You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(andrewbogott: repooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915)
imported>Stashbot
(wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api (7e57832) (T337218) - cookbook ran by dcaro@vulcanus)
 
(330 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== 2020-10-14 ===
=== 2023-06-01 ===
* 21:00 andrewbogott: repooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
* 10:07 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|7e57832}}) ([[phab:T337218|T337218]]) - cookbook ran by dcaro@vulcanus
* 20:37 andrewbogott: depooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
* 09:21 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|0f4076a}}) ([[phab:T336130|T336130]]) - cookbook ran by dcaro@vulcanus
* 20:35 andrewbogott: repooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16
* 09:18 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/buildpack-admission-controller ({{Gerrit|ef7f103}}) ([[phab:T336130|T336130]]) - cookbook ran by dcaro@vulcanus
* 20:31 bd808: Deployed toollabs-webservice v0.74
* 07:52 dcaro: rebooted tools-package-builder-04 (stuck not letting me log in with my user)
* 19:53 andrewbogott: depooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16 and moving to Ceph
* 19:47 andrewbogott: repooling tools-sgeexec-0932, 33, 34 and moving to Ceph
* 19:07 andrewbogott: depooling tools-sgeexec-0932, 33, 34 and moving to Ceph
* 19:06 andrewbogott: repooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph
* 16:56 andrewbogott: depooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph


=== 2020-10-10 ===
=== 2023-05-31 ===
* 17:07 bstorm: cleared errors on tools-sgeexec-0912.tools.eqiad.wmflabs to get the queue moving again
* 02:38 andrewbogott: rebooted tools-sgeweblight-10-16,  [[phab:T337806|T337806]]


=== 2020-10-08 ===
=== 2023-05-30 ===
* 17:07 bstorm: rebuilding docker images with locales-all [[phab:T263339|T263339]]
* 00:22 andrewbogott: rebooted tools-sgeweblight-10-30,  oom
* 00:16 andrewbogott: rebooted tools-sgeweblight-10-24, seems to be oom


=== 2020-10-06 ===
=== 2023-05-26 ===
* 19:04 andrewbogott: uncordoned tools-k8s-worker-38
* 13:13 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/buildpack-admission-controller ({{Gerrit|ef7f103}}) ([[phab:T337218|T337218]]) - cookbook ran by dcaro@vulcanus
* 18:51 andrewbogott: uncordoned tools-k8s-worker-52
* 12:59 dcaro: rebooting tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud for stale NFS handles (D processes)
* 18:40 andrewbogott: draining and cordoning tools-k8s-worker-52 and tools-k8s-worker-38 for ceph migration


=== 2020-10-02 ===
=== 2023-05-24 ===
* 21:09 bstorm: rebooting tools-k8s-worker-70 because it seems to be unable to recover from an old NFS disconnect
* 12:28 dcaro: deploy latest buildservice ([[phab:T335865|T335865]])
* 17:37 andrewbogott: stopping tools-prometheus-03 to attempt a snapshot
* 12:28 dcaro: deploy latest buildservice ([[phab:T336050|T336050]])
* 16:03 bstorm: shutting down tools-prometheus-04 to try to fsck the disk


=== 2020-10-01 ===
=== 2023-05-23 ===
* 21:39 andrewbogott: migrating tools-proxy-06 to ceph
* 14:40 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|0c7b25b}}) - cookbook ran by fran@wmf3169
* 21:35 andrewbogott: moving  k8s.tools.eqiad1.wikimedia.cloud from 172.16.0.99 (toolsbeta-test-k8s-haproxy-1) to 172.16.0.108 (toolsbeta-test-k8s-haproxy-2) in anticipation of downtime for haproxy-1 tomorrow


=== 2020-09-30 ===
=== 2023-05-22 ===
* 18:34 andrewbogott: repooling tools-sgeexec-0918
* 10:06 arturo: hard-reboot tools-sgeexec-10-18 (monitoring reporting it as down)
* 18:29 andrewbogott: depooling tools-sgeexec-0918 so I can reboot cloudvirt1036


=== 2020-09-23 ===
=== 2023-05-19 ===
* 21:38 bstorm: ran an 'apt clean' across the fleet to get ahead of the new locale install
* 13:38 arturo: uncordon tools-k8s-worker-47/48/64/75
* 08:46 bd808: Building new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images ([[phab:T323522|T323522]], [[phab:T320904|T320904]])


=== 2020-09-18 ===
=== 2023-05-17 ===
* 19:41 andrewbogott: repooling tools-k8s-worker-30, 33, 34, 57, 60
* 16:05 dcaro: release toolforge-cli 0.3.0 ([[phab:T336225|T336225]])
* 19:04 andrewbogott: depooling tools-k8s-worker-30, 33, 34, 57, 60
* 12:48 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|fa8ed2c}}) ([[phab:T336225|T336225]]) - cookbook ran by dcaro@vulcanus
* 19:02 andrewbogott: repooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
* 12:48 wm-bot2: rebooted k8s node tools-k8s-worker-71 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:48 andrewbogott: depooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
* 12:45 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|d1bb238}}) ([[phab:T336225|T336225]]) - cookbook ran by dcaro@vulcanus
* 17:47 andrewbogott: repooling tools-k8s-worker-31, 32, 36, 39, 40
* 12:43 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|8d21314}}) - cookbook ran by dcaro@vulcanus
* 16:40 andrewbogott: depooling tools-k8s-worker-31, 32, 36, 39, 40
* 10:54 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:7199a9e from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|7199a9e}}) - cookbook ran by fran@wmf3169
* 16:38 andrewbogott: repooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
* 08:49 wm-bot2: rebooted k8s node tools-k8s-worker-55 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 16:10 andrewbogott: depooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
* 08:33 wm-bot2: rebooted k8s node tools-k8s-worker-64 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 13:54 andrewbogott: repooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916
* 08:32 wm-bot2: rebooted k8s node tools-k8s-worker-75 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 13:50 andrewbogott: depooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916  for flavor update
* 08:25 wm-bot2: rebooted k8s node tools-k8s-worker-74 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 01:20 andrewbogott: repooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912  after flavor update
* 08:17 wm-bot2: rebooted k8s node tools-k8s-worker-61 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 01:11 andrewbogott: depooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912  for flavor update
* 08:10 wm-bot2: rebooted k8s node tools-k8s-worker-70 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 01:08 andrewbogott: repooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920  after flavor update
* 08:03 wm-bot2: rebooted k8s node tools-k8s-worker-66 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 01:00 andrewbogott: depooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920  for flavor update
* 07:54 wm-bot2: rebooted k8s node tools-k8s-worker-72 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 00:58 andrewbogott: repooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 after flavor update
* 07:46 wm-bot2: rebooted k8s node tools-k8s-worker-47 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 00:49 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update
* 07:45 wm-bot2: rebooted k8s node tools-k8s-worker-48 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:42 wm-bot2: rebooted k8s node tools-k8s-worker-69 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:29 wm-bot2: rebooted k8s node tools-k8s-worker-76 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus


=== 2020-09-17 ===
=== 2023-05-16 ===
* 21:56 bd808: Built and deployed tools-manifest v0.22 ([[phab:T263190|T263190]])
* 23:24 bd808: kubectl uncordon tools-k8s-worker-69
* 21:55 bd808: Built and deployed tools-manifest v0.22 ([[phab:T169695|T169695]])
* 23:22 bd808: Force reboot tools-k8s-worker-69 via Horizon
* 20:34 bd808: Live hacked "--backend=gridengine" into webservicemonitor on tools-sgecron-01 ([[phab:T263190|T263190]])
* 23:18 bd808: kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-69
* 20:21 bd808: Restarted webservicemonitor on tools-sgecron-01.tools.eqiad.wmflabs
* 23:17 bd808: kubectl cordon tools-k8s-worker-69
* 20:09 andrewbogott: I didn't actually depool tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 because there was some kind of brief outage just now
* 14:37 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/builds-api:35b57c6 from https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api.git ({{Gerrit|35b57c6}}) - cookbook ran by dcaro@vulcanus
* 19:58 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update
* 13:05 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|df52a39}}) ([[phab:T334081|T334081]]) - cookbook ran by dcaro@vulcanus
* 19:55 andrewbogott: repooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
* 12:54 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|ad5b2b5}}) ([[phab:T334081|T334081]]) - cookbook ran by dcaro@vulcanus
* 19:29 andrewbogott: depooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
* 11:52 dcaro: release toolforge-weld 0.2.0 and toolforge-webservice 0.98
* 15:38 andrewbogott: repooling tools-k8s-worker-70 and tools-k8s-worker-66 after flavor remapping
* 08:08 dcaro: reboot tools-mail-03 ([[phab:T316544|T316544]])
* 15:34 andrewbogott: depooling tools-k8s-worker-70 and tools-k8s-worker-66 for flavor remapping
* 08:07 dcaro: reboot tools-sgebastion-10 ([[phab:T316544|T316544]])
* 15:30 andrewbogott: repooling tools-sgeexec-0909, 0908, 0907, 0906, 0904
* 15:21 andrewbogott: depooling tools-sgeexec-0909, 0908, 0907, 0906, 0904 for flavor remapping
* 13:55 andrewbogott: depooled tools-sgewebgrid-lighttpd-0917 and tools-sgewebgrid-lighttpd-0920
* 13:55 andrewbogott: repooled tools-sgeexec-0937 after move to ceph
* 13:45 andrewbogott: depooled tools-sgeexec-0937 for move to ceph


=== 2020-09-16 ===
=== 2023-05-15 ===
* 23:20 andrewbogott: repooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
* 22:50 bd808: Rebuilding bullseye and buster docker containers to pick up make package addition ([[phab:T320343|T320343]])
* 23:03 andrewbogott: depooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
* 22:09 wm-bot2: rebooted k8s node tools-k8s-worker-66 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 23:02 andrewbogott: uncordoned tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
* 22:07 wm-bot2: rebooted k8s node tools-k8s-worker-65 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 22:29 andrewbogott: draining tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
* 22:06 wm-bot2: rebooted k8s node tools-k8s-worker-64 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 17:37 andrewbogott: service gridengine-master restart on tools-sgegrid-master
* 22:04 wm-bot2: rebooted k8s node tools-k8s-worker-62 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 22:02 wm-bot2: rebooted k8s node tools-k8s-worker-61 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:58 wm-bot2: rebooted k8s node tools-k8s-worker-60 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:56 wm-bot2: rebooted k8s node tools-k8s-worker-59 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:54 wm-bot2: rebooted k8s node tools-k8s-worker-58 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:52 wm-bot2: rebooted k8s node tools-k8s-worker-57 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:51 wm-bot2: rebooted k8s node tools-k8s-worker-56 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:50 wm-bot2: rebooted k8s node tools-k8s-worker-55 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:49 wm-bot2: rebooted k8s node tools-k8s-worker-54 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:47 wm-bot2: rebooted k8s node tools-k8s-worker-53 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:44 wm-bot2: rebooted k8s node tools-k8s-worker-52 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:42 wm-bot2: rebooted k8s node tools-k8s-worker-51 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:41 wm-bot2: rebooted k8s node tools-k8s-worker-50 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:40 wm-bot2: rebooted k8s node tools-k8s-worker-49 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:38 wm-bot2: rebooted k8s node tools-k8s-worker-48 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:37 wm-bot2: rebooted k8s node tools-k8s-worker-47 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:33 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by andrew@bullseye
* 21:16 wm-bot2: rebooted k8s node tools-k8s-worker-45 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:15 wm-bot2: rebooted k8s node tools-k8s-worker-44 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:13 wm-bot2: rebooted k8s node tools-k8s-worker-43 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:12 wm-bot2: rebooted k8s node tools-k8s-worker-42 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:09 wm-bot2: rebooted k8s node tools-k8s-worker-41 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:03 wm-bot2: rebooted k8s node tools-k8s-worker-40 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:58 wm-bot2: rebooted k8s node tools-k8s-worker-39 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:52 wm-bot2: rebooted k8s node tools-k8s-worker-38 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:50 wm-bot2: rebooted k8s node tools-k8s-worker-37 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:49 wm-bot2: rebooted k8s node tools-k8s-worker-36 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:48 wm-bot2: rebooted k8s node tools-k8s-worker-35 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:47 wm-bot2: rebooted k8s node tools-k8s-worker-34 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:42 wm-bot2: rebooted k8s node tools-k8s-worker-33 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:41 andrewbogott: rebooting frozen VMs: tools-k8s-worker-65, tools-sgeweblight-10-27, tools-k8s-worker-45, tools-k8s-worker-36, tools-sgewebgen-10-3 (fallout from earlier nfs outage)
* 20:36 wm-bot2: rebooted k8s node tools-k8s-worker-32 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:32 wm-bot2: rebooted k8s node tools-k8s-worker-31 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:24 wm-bot2: rebooted k8s node tools-k8s-worker-30 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 19:04 wm-bot2: rebooted k8s node tools-k8s-worker-67 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:56 wm-bot2: rebooted k8s node tools-k8s-worker-68 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:49 wm-bot2: rebooted k8s node tools-k8s-worker-69 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:46 bd808: Hard reboot tools-static-14 via Horizon per IRC report of unresponsive requests
* 18:44 wm-bot2: rebooted k8s node tools-k8s-worker-70 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:42 wm-bot2: rebooted k8s node tools-k8s-worker-71 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:39 wm-bot2: rebooted k8s node tools-k8s-worker-72 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:34 wm-bot2: rebooted k8s node tools-k8s-worker-73 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:28 wm-bot2: rebooted k8s node tools-k8s-worker-74 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:22 wm-bot2: rebooted k8s node tools-k8s-worker-75 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:22 taavi: clear mail queue
* 18:21 wm-bot2: rebooted k8s node tools-k8s-worker-76 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:15 wm-bot2: rebooted k8s node tools-k8s-worker-77 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:08 wm-bot2: rebooted k8s node tools-k8s-worker-80 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:06 wm-bot2: rebooted k8s node tools-k8s-worker-81 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:05 wm-bot2: rebooted k8s node tools-k8s-worker-82 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:57 wm-bot2: rebooted k8s node tools-k8s-worker-83 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:48 wm-bot2: rebooted k8s node tools-k8s-worker-84 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:47 wm-bot2: rebooted k8s node tools-k8s-worker-85 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:38 wm-bot2: rebooted k8s node tools-k8s-worker-86 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:37 wm-bot2: rebooted k8s node tools-k8s-worker-87 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:35 wm-bot2: rebooted k8s node tools-k8s-worker-88 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:34 wm-bot2: rebooting all the workers of tools k8s cluster (64 nodes) ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:20 wm-bot2: rebooted k8s node tools-k8s-worker-87 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:19 wm-bot2: rebooted k8s node tools-k8s-worker-88 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:17 bd808: Rebuilding bullseye and buster docker containers to pick up openssh-client package addition ([[phab:T258841|T258841]])
* 17:12 wm-bot2: rebooting the whole tools k8s cluster (64 nodes) ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:06 dcaro: rebooting tools-sgegrid-shadow ([[phab:T316544|T316544]])
* 17:00 dcaro: rebooting tools-sgegrid-master ([[phab:T316544|T316544]])
* 16:55 dcaro: rebooting tools-sgeexec-10-20 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-18 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-25 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-20 ([[phab:T316544|T316544]])
* 16:52 dcaro: rebooting tools-sgeweblight-10-21 ([[phab:T316544|T316544]])
* 16:52 dcaro: rebooting tools-sgeexec-10-22 ([[phab:T316544|T316544]])
* 16:51 dcaro: rebooting tools-sgeweblight-10-28 ([[phab:T316544|T316544]])
* 16:50 dcaro: rebooting tools-sgeexec-10-17 ([[phab:T316544|T316544]])
* 16:48 dcaro: rebooting tools-sgeexec-10-21 ([[phab:T316544|T316544]])
* 16:47 dcaro: rebooting tools-sgeexec-10-19 ([[phab:T316544|T316544]])
* 16:45 dcaro: rebooting tools-sgeexec-10-8 ([[phab:T316544|T316544]])
* 16:45 dcaro: rebooting tools-sgeweblight-10-24 ([[phab:T316544|T316544]])
* 16:44 dcaro: rebooting tools-sgewebgen-10-2 ([[phab:T316544|T316544]])
* 16:44 dcaro: rebooting tools-sgeweblight-10-16 ([[phab:T316544|T316544]])
* 16:43 dcaro: rebooting tools-sgeweblight-10-30 ([[phab:T316544|T316544]])
* 16:43 dcaro: rebooting tools-sgeexec-10-18 ([[phab:T316544|T316544]])
* 16:42 dcaro: rebooting tools-sgeexec-10-16 ([[phab:T316544|T316544]])
* 16:42 dcaro: rebooting tools-sgeexec-10-14 ([[phab:T316544|T316544]])
* 16:41 dcaro: rebooting tools-sgeweblight-10-32 ([[phab:T316544|T316544]])
* 16:40 dcaro: rebooting tools-sgeweblight-10-22 ([[phab:T316544|T316544]])
* 16:39 dcaro: rebooting tools-sgeweblight-10-17 ([[phab:T316544|T316544]])
* 16:32 dcaro: rebooting tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud ([[phab:T316544|T316544]])
* 16:23 dcaro: rebooting tools-sgeweblight-10-26 ([[phab:T316544|T316544]])
* 16:15 bd808: Hard reboot of tools-sgebastion-11 via Horizon (done circa 16:11Z)
* 16:14 arturo: rebooted a bunch of nodes to cleanup D procs and high load avg because NFS outage (result of [[phab:T316544|T316544]])
* 12:36 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/builds-api:09f3b49-dev from https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api.git ({{Gerrit|32a8ae9}}) - cookbook ran by dcaro@vulcanus
* 09:12 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:c64da5a from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|c64da5a}}) - cookbook ran by dcaro@vulcanus


=== 2020-09-10 ===
=== 2023-05-13 ===
* 15:37 arturo: hard-rebooting tools-proxy-05
* 09:13 taavi: reboot tools-sgeexec-10-15,17,18,21
* 15:33 arturo: rebooting tools-proxy-05 to try flushing local DNS caches
* 15:25 arturo: detected missing DNS record for k8s.tools.eqiad1.wikimedia.cloud which means the k8s cluster is down
* 10:22 arturo: enabling ingress dedicated worker nodes in the k8s cluster ([[phab:T250172|T250172]])


=== 2020-09-09 ===
=== 2023-05-11 ===
* 11:12 arturo: new ingress nodes added to the cluster, and tainted/labeled per the docs https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying#ingress_nodes ([[phab:T250172|T250172]])
* 15:48 bd808: Rebooted tools-sgebastion-10 for [[phab:T336510|T336510]]
* 10:50 arturo: created puppet prefix `tools-k8s-ingress` ([[phab:T250172|T250172]])
* 15:31 bd808: Sent `wall` for reboot of tools-sgebastion-10 circa 15:40Z
* 10:42 arturo: created VMs tools-k8s-ingress-1 and tools-k8s-ingress-2 in the `tools-ingress` server group [[phab:T250172|T250172]])
* 10:38 arturo: created server group `tools-ingress` with soft anti affinity policy ([[phab:T250172|T250172]])


=== 2020-09-08 ===
=== 2023-05-09 ===
* 23:24 bstorm: clearing grid queue error states blocking job runs
* 16:36 taavi: delegated beta.toolforge.org domain to toolsbeta per [[phab:T257386|T257386]]
* 22:53 bd808: forcing puppet run on tools-sgebastion-07
* 09:35 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|ad4fa2a}}) - cookbook ran by taavi@runko


=== 2020-09-02 ===
=== 2023-05-08 ===
* 18:13 andrewbogott: moving tools-sgeexec-0920  to ceph
* 09:12 arturo: force-reboot tools-sgeexec-10-13 (reported as down by the monitoring, no SSH)
* 17:57 andrewbogott: moving tools-sgeexec-0942  to ceph


=== 2020-08-31 ===
=== 2023-05-07 ===
* 19:58 andrewbogott: migrating tools-sgeexec-091[0-9] to ceph
* 16:06 taavi: remove inbound 25/tcp rule from the toolserver legacy server [[phab:T136225|T136225]]
* 17:19 andrewbogott: migrating tools-sgeexec-090[4-9] to ceph
* 17:19 andrewbogott: repooled tools-sgeexec-0901
* 16:52 bstorm: `apt install uwsgi` was run on tools-checker-03 in the last log [[phab:T261677|T261677]]
* 16:51 bstorm: running `apt install uwsgi` with --allow-downgrades to fix the puppet setup there [[phab:T261677|T261677]]
* 14:26 andrewbogott: depooling tools-sgeexec-0901, migrating to ceph


=== 2020-08-30 ===
=== 2023-05-05 ===
* 00:57 Krenair: also ran qconf -ds on each
* 22:21 bd808: Added "RepoLookoutBot" to hiera key "dynamicproxy::blocked_user_agent_regex" to stop unnecessary scans by https://www.repo-lookout.org/
* 00:35 Krenair: Tidied up SGE problems (it was spamming root@ every minute for hours) following host deletions some hours ago - removed tools-sgeexec-0921 through 0931 from @general, ran qmod -rj on all jobs registered for those nodes, then qdel -f on the remainders, then qconf -de on each deleted node
* 22:20 bd808: Added
* 11:30 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:811164e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|811164e}}) - cookbook ran by taavi@runko
* 09:13 dcaro: rebooted tools-sgeexec-10-16 as it was stuck ([[phab:T335009|T335009]])


=== 2020-08-29 ===
=== 2023-05-04 ===
* 16:02 bstorm: deleting "tools-sgeexec-0931", "tools-sgeexec-0930", "tools-sgeexec-0929", "tools-sgeexec-0928", "tools-sgeexec-0927"
* 15:15 wm-bot2: removed instance tools-k8s-etcd-15 - cookbook ran by andrew@bullseye
* 16:00 bstorm: deleting  "tools-sgeexec-0926", "tools-sgeexec-0925", "tools-sgeexec-0924", "tools-sgeexec-0923", "tools-sgeexec-0922", "tools-sgeexec-0921"
* 14:13 wm-bot2: removed instance tools-k8s-etcd-14 - cookbook ran by andrew@bullseye


=== 2020-08-26 ===
=== 2023-05-03 ===
* 21:08 bd808: Disabled puppet on tools-proxy-06 to test fixes for a bug in the new [[phab:T251628|T251628]] code
* 12:41 wm-bot2: removed instance tools-k8s-etcd-13 - cookbook ran by andrew@bullseye
* 08:54 arturo: merged several patches by bryan for toolforge front proxy (cleanups, etc) example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/622435


=== 2020-08-25 ===
=== 2023-05-02 ===
* 19:38 andrewbogott: deleting tools-sgeexec-0943.tools.eqiad.wmflabs, tools-sgeexec-0944.tools.eqiad.wmflabs, tools-sgeexec-0945.tools.eqiad.wmflabs, tools-sgeexec-0946.tools.eqiad.wmflabs, tools-sgeexec-0948.tools.eqiad.wmflabs, tools-sgeexec-0949.tools.eqiad.wmflabs, tools-sgeexec-0953.tools.eqiad.wmflabs — they are broken and we're not very curious why; will retry this exercise when everything is standardized on
* 00:29 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|7199a9e}}) - cookbook ran by raymond@ubuntu
* 15:03 andrewbogott: removing non-ceph nodes tools-sgeexec-0921 through tools-sgeexec-0931
* 15:02 andrewbogott: added new sge-exec nodes tools-sgeexec-0943 through tools-sgeexec-0953 (for real this time)


=== 2020-08-19 ===
=== 2023-05-01 ===
* 21:29 andrewbogott: shutting down and removing  tools-k8s-worker-20 through tools-k8s-worker-29; this load can now be handled by new nodes on ceph hosts
* 23:17 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:3b3803f from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|3b3803f}}) - cookbook ran by raymond@ubuntu
* 21:15 andrewbogott: shutting down and removing  tools-k8s-worker-1 through tools-k8s-worker-19; this load can now be handled by new nodes on ceph hosts
* 18:40 andrewbogott: creating 13 new xlarge k8s worker nodes, tools-k8s-worker-67 through tools-k8s-worker-79


=== 2020-08-18 ===
=== 2023-04-28 ===
* 15:24 bd808: Rebuilding all Docker containers to pick up newest versions of installed packages
* 15:01 arturo: force reboot tools-k8s-worker-79, unresponsive
* 08:27 dcaro: rebooting tools-sgeweblight-10-28 ([[phab:T335336|T335336]])
* 07:20 dcaro: rebooting tools-sgegrid-shadow due to stale nfs mount
* 00:09 bd808: `kubectl uncordon tools-k8s-worker-67` ([[phab:T335543|T335543]])
* 00:07 bd808: Hard reboot tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud via horizon ([[phab:T335543|T335543]])
* 00:04 bd808: Rebooting tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud ([[phab:T335543|T335543]])


=== 2020-07-30 ===
=== 2023-04-27 ===
* 16:28 andrewbogott: added new xlarge ceph-hosted worker nodes: tools-k8s-worker-61, 62, 63, 64, 65, 66.  [[phab:T258663|T258663]]
* 23:59 bd808: `kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-67` ([[phab:T335543|T335543]])
* 20:50 bd808: Started process to rebuild all buster and bullseye based container images again. Prior problem seems to have been stale images in local cache on the build server.
* 20:42 bd808: Container image rebuild failed with GPG errors in buster-sssd base image. Will investigate and attempt to restart once resolved in a local dev environment.
* 20:33 bd808: Started process to rebuild all buster and bullseye based container images per https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images


=== 2020-07-29 ===
=== 2023-04-18 ===
* 23:24 bd808: Pushed a copy of docker-registry.wikimedia.org/wikimedia-jessie:latest to docker-registry.tools.wmflabs.org/wikimedia-jessie:latest in preparation for the upstream image going away
* 16:46 dcaro: force-rebooting tools-sgeweblight-10-25/26/27 as they got stuck stopping the grid_exec process
* 16:35 dcaro: rebooting root@tools-sgeweblight-10-27 due to stuck exec daemon not releasing port 6445
* 16:35 dcaro: rebooting root@tools-sgeweblight-10-25 due to stuck exec daemon not releasing port 6445
* 16:32 dcaro: rebooting root@tools-sgeweblight-10-26 due to stuck exec daemon not releasing port 6445
* 16:26 dcaro: rebooting root@tools-sgeexec-10-14 due to stuck exec daemon not releasing port 6445


=== 2020-07-24 ===
=== 2023-04-17 ===
* 22:33 bd808: Removed a few more ancient docker images: grrrit, jessie-toollabs, and nagf
* 13:10 dcaro: rebooting tools-sgegrid-master node ([[phab:T334847|T334847]])
* 21:02 bd808: Running cleanup script to delete the non-sssd toolforge images from docker-registry.tools.wmflabs.org
* 02:43 legoktm: manual restart of apache2 on toolserver-proxy-1 to completely pick up renewed TLS cert (alert was flapping)
* 20:17 bd808: Forced garbage collection on docker-registry.tools.wmflabs.org
* 20:06 bd808: Running cleanup script to delete all of the old toollabs-* images from docker-registry.tools.wmflabs.org


=== 2020-07-22 ===
=== 2023-04-11 ===
* 23:24 bstorm: created server group 'tools-k8s-worker' to create any new worker nodes in so that they have a low chance of being scheduled together by openstack unless it is necessary [[phab:T258663|T258663]]
* 16:11 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|b65439b}}) - cookbook ran by arturo@nostromo
* 23:22 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[56-60] [[phab:T257945|T257945]]
* 15:46 arturo: upload toolforge-jobs-framework-cli v11 to aptly
* 23:17 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[41-55] [[phab:T257945|T257945]]
* 14:17 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller.git ({{Gerrit|d878e49}}) ([[phab:T324834|T324834]]) - cookbook ran by dcaro@vulcanus
* 23:14 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[21-40] [[phab:T257945|T257945]]
* 13:19 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:c6c693c from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c6c693c}}) - cookbook ran by arturo@nostromo
* 23:11 bstorm: running puppet and NFS remount on tools-k8s-worker-[1-15] [[phab:T257945|T257945]]
* 12:09 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:40bd3b3 from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|40bd3b3}}) - cookbook ran by dcaro@vulcanus
* 23:07 bstorm: disabling puppet on k8s workers to reduce the effect of changing the NFS mount version all at once [[phab:T257945|T257945]]
* 10:34 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|9aed7e5}}) - cookbook ran by taavi@runko
* 22:28 bstorm: setting tools-k8s-control prefix to mount NFS v4.2 [[phab:T257945|T257945]]
* 09:15 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/calico ({{Gerrit|c6a3e29}}) ([[phab:T329677|T329677]]) - cookbook ran by taavi@runko
* 22:15 bstorm: set the tools-k8s-control nodes to also use 800MBps to prevent issues with toolforge ingress and api system
* 08:45 wm-bot2: Adding a new k8s worker node - cookbook ran by taavi@runko
* 22:07 bstorm: set the tools-k8s-haproxy-1 (main load balancer for toolforge) to have an egress limit of 800MB per sec instead of the same as all the other servers


=== 2020-07-21 ===
=== 2023-04-10 ===
* 16:09 bstorm: rebooting tools-sgegrid-shadow to remount NFS correctly
* 10:46 taavi: patch existing PSP roles to use policy/v1beta1 [[phab:T331619|T331619]]
* 15:55 bstorm: set the bastion prefix to have explicitly set hiera value of profile::wmcs::nfsclient::nfs_version: '4'
* 09:16 arturo: upgrading k8s cluster to 1.22 ([[phab:T286856|T286856]])


=== 2020-07-17 ===
=== 2023-04-07 ===
* 16:47 bd808: Enabled Puppet on tools-proxy-06 following successful test ([[phab:T102367|T102367]])
* 14:34 wm-bot2: drained, depooled and removed k8s control node tools-k8s-control-3 ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 16:29 bd808: Disabled Puppet on tools-proxy-06 to test nginx config changes manually ([[phab:T102367|T102367]])
* 14:30 wm-bot2: removed instance tools-k8s-control-2 - cookbook ran by taavi@runko


=== 2020-07-15 ===
=== 2023-04-05 ===
* 23:11 bd808: Removed ssh root key for valhallasw from project hiera ([[phab:T255697|T255697]])
* 15:16 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|5ea5992}}) - cookbook ran by taavi@runko
* 15:10 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:3569803 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|3569803}}) - cookbook ran by taavi@runko
* 14:56 wm-bot2: Added a new k8s worker tools-k8s-worker-88.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:42 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:42 wm-bot2: Added a new k8s worker tools-k8s-worker-87.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:28 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:28 wm-bot2: Added a new k8s worker tools-k8s-worker-86.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:15 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:15 wm-bot2: Added a new k8s worker tools-k8s-worker-85.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:01 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:01 wm-bot2: Added a new k8s worker tools-k8s-worker-84.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:47 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:47 wm-bot2: Added a new k8s worker tools-k8s-worker-83.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:34 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:33 wm-bot2: removed instance tools-k8s-worker-83 - cookbook ran by taavi@runko
* 13:15 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:06 wm-bot2: removing grid node tools-sgeweblight-10-31.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:02 wm-bot2: removing grid node tools-sgeweblight-10-29.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:00 wm-bot2: removing grid node tools-sgeexec-10-9.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:58 wm-bot2: removing grid node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:54 wm-bot2: removing grid node tools-sgeexec-10-7.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:52 wm-bot2: removing grid node tools-sgeweblight-10-13.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:34 wm-bot2: drained, depooled and removed k8s control node tools-k8s-control-1 - cookbook ran by taavi@runko
* 12:07 wm-bot2: Added a new k8s control tools-k8s-control-6.tools.eqiad1.wikimedia.cloud to the cluster - cookbook ran by taavi@runko
* 11:53 wm-bot2: Adding a new k8s control node - cookbook ran by taavi@runko
* 11:51 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:39 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 11:38 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:21 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 11:21 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:09 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 10:53 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 10:41 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 10:41 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 10:16 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko


=== 2020-07-09 ===
=== 2023-04-04 ===
* 18:53 bd808: Updating git-review to 1.27 via clush across cluster ([[phab:T257496|T257496]])
* 19:00 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 18:59 wm-bot2: removed instance tools-k8s-control-5 - cookbook ran by taavi@runko
* 18:46 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 18:45 wm-bot2: Adding a new k8s CONTROL node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 10:15 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 09:28 arturo: hard-reboot the 3 k8s control nodes


=== 2020-07-08 ===
=== 2023-04-03 ===
* 11:16 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 -- important change to front-proxy ([[phab:T234617|T234617]])
* 17:13 wm-bot2: rebooted k8s node tools-k8s-worker-31 - cookbook ran by taavi@runko
* 11:11 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 ([[phab:T234617|T234617]])
* 17:11 wm-bot2: rebooted k8s node tools-k8s-worker-32 - cookbook ran by taavi@runko
* 17:09 wm-bot2: rebooted k8s node tools-k8s-worker-33 - cookbook ran by taavi@runko
* 17:07 wm-bot2: rebooted k8s node tools-k8s-worker-34 - cookbook ran by taavi@runko
* 17:05 wm-bot2: rebooted k8s node tools-k8s-worker-35 - cookbook ran by taavi@runko
* 17:04 wm-bot2: rebooted k8s node tools-k8s-worker-36 - cookbook ran by taavi@runko
* 17:02 wm-bot2: rebooted k8s node tools-k8s-worker-37 - cookbook ran by taavi@runko
* 17:00 wm-bot2: rebooted k8s node tools-k8s-worker-38 - cookbook ran by taavi@runko
* 16:58 wm-bot2: rebooted k8s node tools-k8s-worker-39 - cookbook ran by taavi@runko
* 16:56 wm-bot2: rebooted k8s node tools-k8s-worker-40 - cookbook ran by taavi@runko
* 16:55 wm-bot2: rebooted k8s node tools-k8s-worker-41 - cookbook ran by taavi@runko
* 16:53 wm-bot2: rebooted k8s node tools-k8s-worker-42 - cookbook ran by taavi@runko
* 16:51 wm-bot2: rebooted k8s node tools-k8s-worker-43 - cookbook ran by taavi@runko
* 16:49 wm-bot2: rebooted k8s node tools-k8s-worker-44 - cookbook ran by taavi@runko
* 16:45 wm-bot2: rebooted k8s node tools-k8s-worker-45 - cookbook ran by taavi@runko
* 16:43 wm-bot2: rebooted k8s node tools-k8s-worker-46 - cookbook ran by taavi@runko
* 16:41 wm-bot2: rebooted k8s node tools-k8s-worker-47 - cookbook ran by taavi@runko
* 16:40 wm-bot2: rebooted k8s node tools-k8s-worker-48 - cookbook ran by taavi@runko
* 16:38 wm-bot2: rebooted k8s node tools-k8s-worker-49 - cookbook ran by taavi@runko
* 16:36 wm-bot2: rebooted k8s node tools-k8s-worker-50 - cookbook ran by taavi@runko
* 16:35 wm-bot2: rebooted k8s node tools-k8s-worker-51 - cookbook ran by taavi@runko
* 16:33 wm-bot2: rebooted k8s node tools-k8s-worker-52 - cookbook ran by taavi@runko
* 16:31 wm-bot2: rebooted k8s node tools-k8s-worker-53 - cookbook ran by taavi@runko
* 16:28 wm-bot2: rebooted k8s node tools-k8s-worker-54 - cookbook ran by taavi@runko
* 16:27 wm-bot2: rebooted k8s node tools-k8s-worker-55 - cookbook ran by taavi@runko
* 16:25 wm-bot2: rebooted k8s node tools-k8s-worker-56 - cookbook ran by taavi@runko
* 16:23 wm-bot2: rebooted k8s node tools-k8s-worker-57 - cookbook ran by taavi@runko
* 16:21 wm-bot2: rebooted k8s node tools-k8s-worker-58 - cookbook ran by taavi@runko
* 16:20 wm-bot2: rebooted k8s node tools-k8s-worker-59 - cookbook ran by taavi@runko
* 16:18 wm-bot2: rebooted k8s node tools-k8s-worker-60 - cookbook ran by taavi@runko
* 16:09 wm-bot2: rebooted k8s node tools-k8s-worker-61 - cookbook ran by taavi@runko
* 16:07 wm-bot2: rebooted k8s node tools-k8s-worker-62 - cookbook ran by taavi@runko
* 16:01 wm-bot2: rebooted k8s node tools-k8s-worker-64 - cookbook ran by taavi@runko
* 16:00 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 15:58 wm-bot2: rebooted k8s node tools-k8s-worker-65 - cookbook ran by taavi@runko
* 15:56 wm-bot2: rebooted k8s node tools-k8s-worker-66 - cookbook ran by taavi@runko
* 15:48 wm-bot2: rebooted k8s node tools-k8s-worker-67 - cookbook ran by taavi@runko
* 15:38 wm-bot2: rebooted k8s node tools-k8s-worker-68 - cookbook ran by taavi@runko
* 15:36 wm-bot2: rebooted k8s node tools-k8s-worker-69 - cookbook ran by taavi@runko
* 15:34 wm-bot2: rebooted k8s node tools-k8s-worker-70 - cookbook ran by taavi@runko
* 15:32 wm-bot2: rebooted k8s node tools-k8s-worker-71 - cookbook ran by taavi@runko
* 15:30 wm-bot2: rebooted k8s node tools-k8s-worker-72 - cookbook ran by taavi@runko
* 15:28 wm-bot2: rebooted k8s node tools-k8s-worker-73 - cookbook ran by taavi@runko
* 15:26 wm-bot2: rebooted k8s node tools-k8s-worker-74 - cookbook ran by taavi@runko
* 15:24 wm-bot2: rebooted k8s node tools-k8s-worker-75 - cookbook ran by taavi@runko
* 15:22 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 15:17 wm-bot2: rebooted k8s node tools-k8s-worker-75 - cookbook ran by taavi@runko
* 15:14 wm-bot2: rebooted k8s node tools-k8s-worker-76 - cookbook ran by taavi@runko
* 15:12 wm-bot2: rebooted k8s node tools-k8s-worker-77 - cookbook ran by taavi@runko
* 15:10 wm-bot2: rebooted k8s node tools-k8s-worker-78 - cookbook ran by taavi@runko
* 15:08 wm-bot2: rebooted k8s node tools-k8s-worker-79 - cookbook ran by taavi@runko
* 15:06 wm-bot2: rebooted k8s node tools-k8s-worker-80 - cookbook ran by taavi@runko
* 14:59 wm-bot2: rebooted k8s node tools-k8s-worker-81 - cookbook ran by taavi@runko
* 14:41 wm-bot2: rebooted k8s node tools-k8s-worker-82 - cookbook ran by taavi@runko
* 14:38 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 14:13 andrewbogott: test log to see if stashbot is back working
* 13:19 andrewbogott: forcing puppet run on all toolforge VMs
* 08:28 taavi: stop exim4.service on tools-sgecron-2 [[phab:T333477|T333477]]
* 06:52 taavi: stop jobs-framework-emailer to prevent spam due to NFS being read-only [[phab:T333477|T333477]]


=== 2020-07-07 ===
=== 2023-03-29 ===
* 23:22 bd808: Rebuilding all Docker images to pick up webservice v0.73 ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 16:07 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|dc26f52}}) - cookbook ran by raymond@ubuntu
* 23:19 bd808: Deploying webservice v0.73 via clush ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 15:21 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/registry-admission:24115c7 from https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|24115c7}}) - cookbook ran by raymond@ubuntu
* 23:16 bd808: Building webservice v0.73 ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 15:01 Reedy: killed python process from tools.experimental-embeddings using a lot of cpu on tools-sgebastion-07
* 15:01 Reedy: killed meno25 process running pwb.py on tools-sgebastion-07
* 09:59 arturo: point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) ([[phab:T247236|T247236]])


=== 2020-07-06 ===
=== 2023-03-28 ===
* 11:54 arturo: briefly point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) and then switch back to 185.15.56.11 (tools-proxy-05). The legacy redirector does HTTP/307 ([[phab:T247236|T247236]])
* 19:43 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|e1b9815}}) - cookbook ran by raymond@ubuntu
* 11:50 arturo: associate floating IP address 185.15.56.60 to tools-legacy-redirector ([[phab:T247236|T247236]])


=== 2020-07-01 ===
=== 2023-03-27 ===
* 11:19 arturo: cleanup exim email queue (4 frozen messages)
* 22:51 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:70d550a from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|70d550a}}) - cookbook ran by raymond@ubuntu
* 11:02 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/608849 ([[phab:T256737|T256737]])


=== 2020-06-30 ===
=== 2023-03-26 ===
* 11:18 arturo: set some hiera keys for mtail in puppet prefix `tools-mail` ([[phab:T256737|T256737]])
* 20:28 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko


=== 2020-06-29 ===
=== 2023-03-24 ===
* 22:48 legoktm: built html-sssd/web image ([[phab:T241817|T241817]])
* 14:13 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@endurance
* 22:23 legoktm: rebuild python<nowiki>{</nowiki>34,35,37<nowiki>}</nowiki>-sssd/web images for https://gerrit.wikimedia.org/r/608093
* 12:01 arturo: introduced spam filter in the mail server ([[phab:T120210|T120210]])


=== 2020-06-25 ===
=== 2023-03-21 ===
* 21:49 zhuyifei1999_: re-enabling puppet on tools-sgebastion-09 [[phab:T256426|T256426]]
* 08:11 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 21:39 zhuyifei1999_: disabling puppet on tools-sgebastion-09 so I can play with mount settings [[phab:T256426|T256426]]
* 21:24 bstorm: hard rebooting tools-sgebastion-09


=== 2020-06-24 ===
=== 2023-03-20 ===
* 12:36 arturo: live-hacking puppetmaster with exim prometheus stuff ([[phab:T175964|T175964]])
* 13:39 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 11:57 arturo: merging email ratelimiting patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/607320 ([[phab:T175964|T175964]])
* 10:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@endurance


=== 2020-06-23 ===
=== 2023-03-19 ===
* 17:55 arturo: killed procs for users `hamishz` and `msyn` which apparently were tools that should be running in the grid / kubernetes instead
* 09:32 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 16:08 arturo: created acme-chief cert `tools_mail` in the prefix hiera


=== 2020-06-17 ===
=== 2023-03-17 ===
* 10:40 arturo: created VM tools-legacy-redirector, with the corresponding puppet prefix ([[phab:T247236|T247236]], [[phab:T234617|T234617]])
* 15:56 andrewbogott: truncating .out, .err, and .log files to 10MB in anticipation of moving the NFS volumes


=== 2020-06-16 ===
=== 2023-03-13 ===
* 23:01 bd808: Building new Docker images to pick up webservice 0.72
* 09:50 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:f90bd8f from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|f90bd8f}}) - cookbook ran by dcaro@vulcanus
* 22:58 bd808: Deploying webservice 0.72 to bastions and grid
* 22:56 bd808: Building webservice 0.72
* 15:10 arturo: merging a patch with changes to the template for keepalived (used in the elastic cluster) https://gerrit.wikimedia.org/r/c/operations/puppet/+/605898


=== 2020-06-15 ===
=== 2023-03-12 ===
* 21:28 bstorm_: cleaned up killgridjobs.sh on the tools bastions [[phab:T157792|T157792]]
* 13:40 taavi: restart haproxy on tools-k8s-haproxy-3
* 18:14 bd808: Rebuilding all Docker images to pick up webservice 0.71 ([[phab:T254640|T254640]], [[phab:T253412|T253412]])
* 18:12 bd808: Deploying webservice 0.71 to bastions and grid via clush
* 18:05 bd808: Building webservice 0.71


=== 2020-06-12 ===
=== 2023-03-11 ===
* 13:13 arturo: live-hacking session in the puppetmaster ended
* 18:38 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:10 arturo: live-hacing puppet tree in tools-puppetmaster-02 for testing PAWS related patch (they share haproxy puppet code)
* 18:36 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 00:16 bstorm_: remounted NFS for tools-k8s-control-3 and tools-acme-chief-01
* 18:34 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:31 taavi: reboot misbehaving tools-sgeexec-10-11


=== 2020-06-11 ===
=== 2023-03-10 ===
* 23:35 bstorm_: rebooting tools-k8s-control-2 because it seems to be confused on NFS, interestingly enough
* 16:36 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|8b42b15}}) - cookbook ran by taavi@runko


=== 2020-06-04 ===
=== 2023-03-09 ===
* 13:32 bd808: Manually restored /etc/haproxy/conf.d/elastic.cfg on tools-elastic-*
* 10:13 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|53e7f81}}) - cookbook ran by taavi@runko
* 10:04 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/maintain-kubeusers:834807c from https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|834807c}}) - cookbook ran by taavi@runko


=== 2020-06-02 ===
=== 2023-03-08 ===
* 12:23 arturo: renewed TLS cert for k8s metrics-server ([[phab:T250874|T250874]]) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#internal_API_access
* 22:31 bd808: Live hacked user-maintainer clusterrole to work around breakage in [[phab:T331572|T331572]]
* 11:00 arturo: renewed TLS cert for prometheus to contact toolforge k8s ([[phab:T250874|T250874]]) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#external_API_access


=== 2020-06-01 ===
=== 2023-03-07 ===
* 23:51 bstorm_: refreshed certs for the custom webhook controllers on the k8s cluster [[phab:T250874|T250874]]
* 11:34 wm-bot2: Increased quotas by 2 volumes - cookbook ran by fran@wmf3169
* 00:39 bd808: Ugh. Prior SAL message was about tools-sgeexec-0940
* 11:09 wm-bot2: Increased quotas by 6 snapshots - cookbook ran by fran@wmf3169
* 00:39 bd808: Compressed /var/log/account/pacct.0 ahead of rotation schedule to free some space on the root partition
* 11:07 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169


=== 2020-05-29 ===
=== 2023-03-06 ===
* 19:37 bstorm_: adding docker image for paws-public docker-registry.tools.wmflabs.org/paws-public-nginx:openresty [[phab:T252217|T252217]]
* 12:51 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|6688477}}) - cookbook ran by taavi@runko
* 12:33 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/registry-admission:e916fee from https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|e916fee}}) - cookbook ran by taavi@runko
* 12:16 arturo: delete calico deployment, redeploy from https://gitlab.wikimedia.org/repos/cloud/toolforge/calico ([[phab:T328539|T328539]])


=== 2020-05-28 ===
=== 2023-03-05 ===
* 21:19 bd808: Killed 7 python processes run by user 'mattho69' on login.toolforge.org
* 15:43 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|3e04025}}) - cookbook ran by taavi@runko
* 21:06 bstorm_: upgrading tools-k8s-worker-[30-60] to kubernetes 1.16.10 [[phab:T246122|T246122]]
* 17:54 bstorm_: upgraded tools-k8s-worker-[11..15] and starting on -21-29 now [[phab:T246122|T246122]]
* 16:01 bstorm_: kubectl upgraded to 1.16.10 on all bastions [[phab:T246122|T246122]]
* 15:58 arturo: upgrading tools-k8s-worker-[1..10] to 1.16.10 ([[phab:T246122|T246122]])
* 15:41 arturo: upgrading tools-k8s-control-3 to 1.16.10 ([[phab:T246122|T246122]])
* 15:17 arturo: upgrading tools-k8s-control-2 to 1.16.10 ([[phab:T246122|T246122]])
* 15:09 arturo: upgrading tools-k8s-control-1 to 1.16.10 ([[phab:T246122|T246122]])
* 14:49 arturo: cleanup /etc/apt/sources.list.d/ directory in all tools-k8s-* VMs
* 11:27 arturo: merging change to front-proxy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/599139 ([[phab:T253816|T253816]])


=== 2020-05-27 ===
=== 2023-03-02 ===
* 17:23 bstorm_: deleting "tools-k8s-worker-20", "tools-k8s-worker-19", "tools-k8s-worker-18", "tools-k8s-worker-17", "tools-k8s-worker-16"
* 11:32 arturo: aborrero@tools-k8s-control-2:~$ sudo -i kubectl apply -f /etc/kubernetes/toolforge-tool-roles.yaml (https://gerrit.wikimedia.org/r/c/operations/puppet/+/889836)


=== 2020-05-26 ===
=== 2023-03-01 ===
* 18:45 bstorm_: upgrading maintain-kubeusers to match what is in toolsbeta [[phab:T246059|T246059]] [[phab:T211096|T211096]]
* 13:18 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|13eda9d}}) - cookbook ran by taavi@runko
* 16:20 bstorm_: fix incorrect volume name in kubeadm-config configmap [[phab:T246122|T246122]]


=== 2020-05-22 ===
=== 2023-02-28 ===
* 20:00 bstorm_: rebooted tools-sgebastion-07 to clear up tmp file problems with 10 min warning
* 17:19 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|9252af7}}) - cookbook ran by taavi@runko
* 19:12 bstorm_: running command to delete over 2000 tmp ca certs on tools-bastion-07 [[phab:T253412|T253412]]
* 17:04 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e46da83}}) - cookbook ran by taavi@runko


=== 2020-05-21 ===
=== 2023-02-23 ===
* 22:40 bd808: Rebuilding all Docker containers for tools-webservice 0.70 ([[phab:T252700|T252700]])
* 18:07 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|efb60b3}}) - cookbook ran by taavi@runko
* 22:36 bd808: Updated tools-webservice to 0.70 across instances ([[phab:T252700|T252700]])
* 09:33 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/buildpack-admission:b34e2f8 from https://github.com/toolforge/buildpack-admission-controller.git ({{Gerrit|b34e2f8}}) - cookbook ran by taavi@runko
* 22:29 bd808: Building tools-webservice 0.70 via wmcs-package-build.py


=== 2020-05-20 ===
=== 2023-02-21 ===
* 09:59 arturo: now running tesseract-ocr v4.1.1-2~bpo9+1 in the Toolforge grid ([[phab:T247422|T247422]])
* 09:37 arturo: hard-reboot tools-sgeexec-10-11 (unresponsive to ssh)
* 09:50 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'apt-get install tesseract-ocr -t stretch-backports -y'` ([[phab:T247422|T247422]])
* 09:35 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'rm /etc/apt/sources.lists.d/kubeadm-k8s-component-repo.list ; rm /etc/apt/sources.list.d/repository_thirdparty-kubeadm-k8s-1-15.list ; run-puppet-agent'` ([[phab:T247422|T247422]])
* 09:23 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'rm /etc/apt/preferences.d/* ; run-puppet-agent'` ([[phab:T247422|T247422]])


=== 2020-05-19 ===
=== 2023-02-20 ===
* 17:00 bstorm_: deleting/restarting the paws db-proxy pod because it cannot connect to the replicas...and I'm hoping that's due to depooling and such
* 11:24 taavi: redeploy volume-admission with helm and cert-manager certificates [[phab:T329530|T329530]] [[phab:T292238|T292238]]
* 11:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:7fd13ac from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|ede8bd0}}) - cookbook ran by taavi@runko
* 11:05 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-volume-admission-controller:7fd13ac from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|7fd13ac}}) - cookbook ran by taavi@runko
* 10:39 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 09:20 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo


=== 2020-05-13 ===
=== 2023-02-19 ===
* 18:14 bstorm_: upgrading calico to 3.14.0 with typha enabled in Toolforge K8s [[phab:T250863|T250863]]
* 09:16 taavi: uncordon tools-k8s-worker-[80-82] after fixing security groups [[phab:T329378|T329378]]
* 18:10 bstorm_: set "profile::toolforge::k8s::typha_enabled: true" in tools project for calico upgrade [[phab:T250863|T250863]]


=== 2020-05-09 ===
=== 2023-02-17 ===
* 00:28 bstorm_: added nfs.* to ignored_fs_types for the prometheus::node_exporter params in project hiera [[phab:T252260|T252260]]
* 11:32 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|eeeea4c}}) - cookbook ran by arturo@endurance
* 11:31 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|7729b18}}) ([[phab:T254636|T254636]]) - cookbook ran by arturo@endurance
* 11:26 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8a9b97e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|eeeea4c}}) - cookbook ran by arturo@endurance
* 11:24 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8a9b97e from https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-api ({{Gerrit|618ab29}}) - cookbook ran by arturo@endurance
* 10:25 arturo: build and push mariadb-sssd/base docker image for Toolforge ([[phab:T320178|T320178]], [[phab:T254636|T254636]])


=== 2020-05-08 ===
=== 2023-02-16 ===
* 18:17 bd808: Building all jessie-sssd derived images ([[phab:T197930|T197930]])
* 15:58 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 17:29 bd808: Building new jessie-sssd base image ([[phab:T197930|T197930]])
* 15:30 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/cert-manager ({{Gerrit|d71994e}}) - cookbook ran by arturo@nostromo
* 13:52 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|7191997}}) - cookbook ran by taavi@runko
* 13:44 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/ingress-admission:1fe8ec4 from https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|1fe8ec4}}) - cookbook ran by taavi@runko
* 12:47 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/ingress-admission:e9b9920 from https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|e9b9920}}) - cookbook ran by taavi@runko
* 10:35 arturo: aborrero@tools-k8s-control-1:~$ sudo -i kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml
* 09:48 arturo: grid engine was failed over to shadow server, manually put it back into normal https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Grid#GridEngine_Master
* 09:39 arturo: aborrero@tools-sgegrid-shadow:~$ sudo truncate -s 1G /var/log/syslog (was 17G, full root disk)


=== 2020-05-07 ===
=== 2023-02-15 ===
* 21:51 bstorm_: rebuilding the docker images for Toolforge k8s
* 18:03 taavi: deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/889585/ to increase amount of haproxy max connections
* 19:03 bstorm_: toollabs-webservice 0.69 now pushed to the Toolforge bastions
* 15:19 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 18:57 bstorm_: pushing new toollabs-webservice package v0.69 to the tools repos
* 09:50 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/cert-manager.git ({{Gerrit|e3f3ce1}}) ([[phab:T329453|T329453]]) - cookbook ran by taavi@runko
* 09:30 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo


=== 2020-05-06 ===
=== 2023-02-14 ===
* 21:20 bd808: Kubectl delete node tools-k8s-worker-[16-20] ([[phab:T248702|T248702]])
* 15:07 taavi: import cert-manager components to local docker registry [[phab:T329453|T329453]]
* 18:24 bd808: Updated "profile::toolforge::k8s::worker_nodes" list in "tools-k8s-haproxy" prefix puppet ([[phab:T248702|T248702]])
* 12:12 arturo: the fixed webservicemonitor is starting a bunch of grid webservices ([[phab:T329611|T329611]])
* 18:14 bd808: Shutdown tools-k8s-worker-[16-20] instances ([[phab:T248702|T248702]])
* 12:10 arturo: included tools-manifests 0.25 in tools-buster aptly repo, deploying it now! ([[phab:T329611|T329611]], [[phab:T329467|T329467]], [[phab:T244809|T244809]])
* 18:04 bd808: Draining tools-k8s-worker-[16-20] in preparation for decomm ([[phab:T248702|T248702]])
* 17:56 bd808: Cordoned tools-k8s-worker-[16-20] in preparation for decomm ([[phab:T248702|T248702]])
* 00:01 bd808: Joining tools-k8s-worker-60 to the k8s worker pool
* 00:00 bd808: Joining tools-k8s-worker-59 to the k8s worker pool


=== 2020-05-05 ===
=== 2023-02-13 ===
* 23:58 bd808: Joining tools-k8s-worker-58 to the k8s worker pool
* 16:05 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 23:55 bd808: Joining tools-k8s-worker-57 to the k8s worker pool
* 16:03 taavi: update maintain-kubeusers deployment to use helm
* 23:53 bd808: Joining tools-k8s-worker-56 to the k8s worker pool
* 15:05 taavi: deploy jobs-api updates, improving some status messages
* 21:51 bd808: Building 5 new k8s worker nodes ([[phab:T248702|T248702]])
* 15:04 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|13d87c4}}) - cookbook ran by taavi@runko
* 15:00 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:390ed64 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|390ed64}}) - cookbook ran by taavi@runko
* 13:14 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/maintain-kubeusers:aac195b from https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|aac195b}}) - cookbook ran by taavi@runko


=== 2020-05-04 ===
=== 2023-02-10 ===
* 22:08 bstorm_: deleting tools-elastic-01/2/3 [[phab:T236606|T236606]]
* 15:45 taavi: reboot tools-k8s-worker-82 to troubleshoot network issues
* 16:46 arturo: removing the now unused `/etc/apt/preferences.d/toolforge_k8s_kubeadmrepo*` files ([[phab:T250866|T250866]])
* 12:44 wm-bot2: Added a new k8s worker tools-k8s-worker-82.tools.eqiad1.wikimedia.cloud to the worker pool ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 16:43 arturo: removing the now unused `/etc/apt/sources.list.d/toolforge-k8s-kubeadmrepo.list` file ([[phab:T250866|T250866]])
* 12:31 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 12:29 wm-bot2: Added a new k8s worker tools-k8s-worker-81.tools.eqiad1.wikimedia.cloud to the worker pool ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 12:15 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:53 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:44 wm-bot2: removing grid node tools-sgeweblight-10-23.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:42 wm-bot2: removing grid node tools-sgeexec-10-5.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:39 wm-bot2: removing grid node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:26 wm-bot2: removing grid node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:24 wm-bot2: removing grid node tools-sgeexec-10-1.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko


=== 2020-04-29 ===
=== 2023-02-01 ===
* 22:13 bstorm_: running a fixup script after fixing a bug [[phab:T247455|T247455]]
* 16:03 taavi: deployed tools-webservice 0.89
* 21:28 bstorm_: running the rewrite-psp-preset.sh script across all tools [[phab:T247455|T247455]]
* 15:43 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|372037f}}) - cookbook ran by taavi@runko
* 16:54 bstorm_: deleted the maintain-kubeusers pod to start running the new image [[phab:T247455|T247455]]
* 16:52 bstorm_: tagged docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to latest to deploy to toolforge [[phab:T247455|T247455]]


=== 2020-04-28 ===
=== 2023-01-26 ===
* 22:58 bstorm_: rebuilding docker-registry.tools.wmflabs.org/maintain-kubeusers:beta [[phab:T247455|T247455]]
* 15:05 taavi: drain and reboot tools-k8s-worker-74 which seems to have some issues with nfs
* 14:37 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|307f302}}) - cookbook ran by taavi@runko
* 14:30 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:05966c6 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|05966c6}}) - cookbook ran by taavi@runko


=== 2020-04-23 ===
=== 2023-01-24 ===
* 19:22 bd808: Increased Kubernetes services quota for bd808-test tool.
* 12:04 taavi: deploying toolforge-jobs-framework-cli v10 [[phab:T327775|T327775]]
* 10:07 taavi: publish toolforge-jobs-framework-cli v9


=== 2020-04-21 ===
=== 2023-01-23 ===
* 23:06 bstorm_: repooled tools-k8s-worker-38/52, tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 [[phab:T250869|T250869]]
* 11:31 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d5ae229}}) - cookbook ran by taavi@runko
* 22:09 bstorm_: depooling tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 [[phab:T250869|T250869]]
* 11:23 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:d085c50 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d085c50}}) - cookbook ran by taavi@runko
* 22:02 bstorm_: draining tools-k8s-worker-38 and tools-k8s-worker-52 as they are on the crashed host [[phab:T250869|T250869]]
* 11:17 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|864171a}}) - cookbook ran by taavi@runko


=== 2020-04-20 ===
=== 2023-01-20 ===
* 15:31 bd808: Rebuilding Docker containers to pick up tools-webservice v0.68 ([[phab:T250625|T250625]])
* 23:24 andrewbogott: truncating logfiles with find . -name '*.err'  -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 14:47 arturo: added joakino to tools.admin LDAP group
* 21:24 andrewbogott: truncating logfiles with find . -name '*.out'  -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 13:28 jeh: shutdown elasticsearch v5 cluster running Jessie [[phab:T236606|T236606]]
* 01:06 andrewbogott: truncating logfiles with find . -name '*.log'  -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 12:46 arturo: uploading tools-webservice v0.68 to aptly stretch-tools and update it on relevant servers ([[phab:T250625|T250625]])
* 12:06 arturo: uploaded tools-webservice v0.68 to stretch-toolsbeta for testing
* 11:59 arturo: `root@tools-sge-services-03:~# aptly db cleanup` removed 340 unreferenced packages, and 2 unreferenced files


=== 2020-04-15 ===
=== 2023-01-19 ===
* 23:20 bd808: Building ruby25-sssd/base and children ([[phab:T141388|T141388]], [[phab:T250118|T250118]])
* 11:46 arturo: `aborrero@tools-k8s-control-1:~$ sudo -i kubectl delete clusterrolebinding jobs-api-psp` (cleanup unused stuff)
* 20:09 jeh: update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 [[phab:T250206|T250206]]


=== 2020-04-14 ===
=== 2023-01-18 ===
* 18:26 bstorm_: Deployed new code and RBAC for maintain-kubeusers [[phab:T246123|T246123]]
* 15:42 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0ad4c66}}) - cookbook ran by arturo@nostromo
* 18:19 bstorm_: updating the maintain-kubeusers:latest image [[phab:T246123|T246123]]
* 15:29 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:54cc15e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|54cc15e}}) - cookbook ran by arturo@nostromo
* 17:32 bstorm_: updating the maintain-kubeusers:beta image on tools-docker-imagebuilder-01 [[phab:T246123|T246123]]


=== 2020-04-10 ===
=== 2023-01-17 ===
* 21:33 bd808: Rebuilding all Docker images for the Kubernetes cluster ([[phab:T249843|T249843]])
* 13:55 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8cf38a1}}) - cookbook ran by arturo@endurance
* 19:36 bstorm_: after testing deploying toollabs-webservice 0.67 to tools repos [[phab:T249843|T249843]]
* 13:51 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0d0a882}}) - cookbook ran by arturo@endurance
* 14:53 arturo: live-hacking tools-puppetmaster-02 with https://gerrit.wikimedia.org/r/c/operations/puppet/+/587991 for [[phab:T249837|T249837]]
* 13:34 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:3a58c1d from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|3a58c1d}}) - cookbook ran by arturo@endurance


=== 2020-04-09 ===
=== 2023-01-10 ===
* 15:13 bd808: Rebuilding all stretch and buster Docker images. Jessie is broken at the moment due to package version mismatches
* 11:55 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8e0a2f9}}) - cookbook ran by arturo@endurance
* 11:18 arturo: bump nproc limit in bastions https://gerrit.wikimedia.org/r/c/operations/puppet/+/587715 ([[phab:T219070|T219070]])
* 11:52 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:9514b00 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8e0a2f9}}) - cookbook ran by arturo@endurance
* 04:29 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 [try #2] ([[phab:T154504|T154504]], [[phab:T234617|T234617]])
* 11:36 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0243967}}) - cookbook ran by arturo@endurance
* 04:19 bd808: python3 build.py --image-prefix toolforge --tag latest --no-cache --push --single jessie-sssd
* 00:20 bd808: Docker rebuild failed in toolforge-python2-sssd-base: "zlib1g-dev : Depends: zlib1g (= 1:1.2.8.dfsg-2+b1) but 1:1.2.8.dfsg-2+deb8u1 is to be installed"


=== 2020-04-08 ===
=== 2023-01-03 ===
* 23:49 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 ([[phab:T154504|T154504]], [[phab:T234617|T234617]])
* 17:17 andrewbogott: find -name '*.log'  -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 23:35 bstorm_: deploy toollabs-webservice v0.66 [[phab:T154504|T154504]] [[phab:T234617|T234617]]


=== 2020-04-07 ===
=== 2022-12-20 ===
* 20:06 andrewbogott: sss_cache -E on tools-sgebastion-08 and  tools-sgebastion-09
* 09:07 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 20:00 andrewbogott: sss_cache -E on tools-sgebastion-07


=== 2020-04-06 ===
=== 2022-12-12 ===
* 19:16 bstorm_: deleted tools-redis-1001/2 [[phab:T248929|T248929]]
* 14:36 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2020-04-03 ===
=== 2022-12-09 ===
* 22:40 bstorm_: shut down tools-redis-1001/2 [[phab:T248929|T248929]]
* 07:20 taavi: change the canonical tools-mail external hostname to use mail.tools.wmcloud.org and add valid spf to toolforge.org [[phab:T324809|T324809]]
* 22:32 bstorm_: switch tools-redis-1003 to the active redis server [[phab:T248929|T248929]]
* 20:41 bstorm_: deleting tools-redis-1003/4 to attach them to an anti-affinity group [[phab:T248929|T248929]]
* 18:53 bstorm_: spin up tools-redis-1004 on stretch and connect to cluster [[phab:T248929|T248929]]
* 18:23 bstorm_: spin up tools-redis-1003 on stretch and connect to the cluster [[phab:T248929|T248929]]
* 16:50 bstorm_: launching tools-redis-03 (Buster) to see what happens


=== 2020-03-30 ===
=== 2022-12-05 ===
* 18:28 bstorm_: Beginning rolling depool, remount, repool of k8s workers for [[phab:T248702|T248702]]
* 11:06 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 18:22 bstorm_: disabled puppet across tools-k8s-worker-[1-55].tools.eqiad.wmflabs [[phab:T248702|T248702]]
* 16:56 arturo: dropping `_psl.toolforge.org` TXT record ([[phab:T168677|T168677]])


=== 2020-03-27 ===
=== 2022-11-30 ===
* 21:22 bstorm_: removed puppet prefix tools-docker-builder [[phab:T248703|T248703]]
* 10:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|bc3529d}}) - cookbook ran by arturo@nostromo
* 21:15 bstorm_: deleted tools-docker-builder-06 [[phab:T248703|T248703]]
* 10:17 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:c360d54 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c360d54}}) - cookbook ran by arturo@nostromo
* 18:55 bstorm_: launching tools-docker-imagebuilder-01 [[phab:T248703|T248703]]
* 12:52 arturo: install python3-pykube on tools-k8s-control-3 for some tests interaction with the API from python


=== 2020-03-24 ===
=== 2022-11-29 ===
* 11:44 arturo: trying to solve a rebase/merge conflict in labs/private.git in tools-puppetmaster-02
* 19:52 taavi: clear puppet failure emails from exim queues
* 11:33 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ ([[phab:T234617|T234617]]) (second try with some additional bits in LUA)
* 10:16 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ ([[phab:T234617|T234617]])


=== 2020-03-18 ===
=== 2022-11-09 ===
* 19:07 bstorm_: removed role::toollabs::logging::sender from project puppet (it wouldn't work anyway)
* 08:58 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 18:04 bstorm_: removed puppet prefix tools-flannel-etcd [[phab:T246689|T246689]]
* 17:58 bstorm_: removed puppet prefix tools-worker [[phab:T246689|T246689]]
* 17:57 bstorm_: removed puppet prefix tools-k8s-master [[phab:T246689|T246689]]
* 17:36 bstorm_: removed lots of deprecated hiera keys from horizon for the old cluster [[phab:T246689|T246689]]
* 16:59 bstorm_: deleting "tools-worker-1002", "tools-worker-1001", "tools-k8s-master-01", "tools-flannel-etcd-03", "tools-k8s-etcd-03", "tools-flannel-etcd-02", "tools-k8s-etcd-02", "tools-flannel-etcd-01", "tools-k8s-etcd-01" [[phab:T246689|T246689]]


=== 2020-03-17 ===
=== 2022-11-05 ===
* 13:29 arturo: set `profile::toolforge::bastion::nproc: 200` for tools-sgebastion-08 ([[phab:T219070|T219070]])
* 19:28 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.err'  -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 00:08 bstorm_: shut off tools-flannel-etcd-01/02/03 [[phab:T246689|T246689]]
* 13:26 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.log'  -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;


=== 2020-03-16 ===
=== 2022-11-04 ===
* 22:01 bstorm_: shut off tools-k8s-etcd-01/02/03 [[phab:T246689|T246689]]
* 20:41 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.err' -not -newermt "Nov 1, 2021" -exec rm <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 22:00 bstorm_: shut off tools-k8s-master-01 [[phab:T246689|T246689]]
* 14:02 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.log' -not -newermt "Nov 1, 2021" -exec rm <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 21:59 bstorm_: shut down tools-worker-1001 and tools-worker-1002 [[phab:T246689|T246689]]
* 12:20 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d464be4}}) ([[phab:T304900|T304900]]) - cookbook ran by arturo@nostromo
* 12:12 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:2b800f5 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|2b800f5}}) ([[phab:T304900|T304900]]) - cookbook ran by arturo@nostromo


=== 2020-03-11 ===
=== 2022-11-01 ===
* 17:00 jeh: clean up apt cache on tools-sgebastion-07
* 09:37 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T322110|T322110]]) - cookbook ran by dcaro@vulcanus


=== 2020-03-06 ===
=== 2022-10-26 ===
* 16:25 bstorm_: updating maintain-kubeusers image to filter invalid tool names
* 08:45 dcaro: depooling and rebooting tools-sgeexec-10-22 to get nfs scratch working again


=== 2020-03-03 ===
=== 2022-10-25 ===
* 18:16 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad1.wikimedia.cloud (eqiad1 subdomain change) [[phab:T236606|T236606]]
* 16:14 wm-bot2: Increased quotas by 5120 gigabytes - cookbook ran by fran@wmf3169
* 18:02 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad.wikimedia.cloud [[phab:T236606|T236606]]
* 15:26 dcaro: pushed a newer docker-registry.tools.wmflabs.org/python:3.9-slim-bullseye (from upstream pthyon:3.9-slim-bullseye)
* 17:31 jeh: create a OpenStack virtual ip address for the new elasticsearch cluster [[phab:T236606|T236606]]
* 10:54 arturo: deleted VMs `tools-worker-[1003-1020]` (legacy k8s cluster) ([[phab:T246689|T246689]])
* 10:51 arturo: cordoned/drained all legacy k8s worker nodes except 1001/1002 ([[phab:T246689|T246689]])


=== 2020-03-02 ===
=== 2022-10-20 ===
* 22:26 jeh: starting first pass of elasticsearch data migration to new cluster [[phab:T236606|T236606]]
* 16:54 andrewbogott: rebooting tools-package-builder-04
* 16:49 andrewbogott: rebooting redis nodes (one at a time)
* 10:54 taavi: rebuild mono68-sssd image with the expired DST Root CA X3 removed [[phab:T311466|T311466]]


=== 2020-03-01 ===
=== 2022-10-18 ===
* 01:48 bstorm_: old version of kubectl removed. Anyone who needs it can download it with `curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.4.12/bin/linux/amd64/kubectl`
* 11:52 taavi: deploy toolforge-jobs-framework-cli deb v8
* 01:27 bstorm_: running the force-migrate command to make sure any new kubernetes deployments are on the new cluster.
* 10:30 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|64385e9}}) ([[phab:T320405|T320405]]) - cookbook ran by arturo@nostromo
* 10:27 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:9be2272 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|9be2272}}) - cookbook ran by taavi@runko
* 10:18 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|64385e9}}) ([[phab:T320405|T320405]]) - cookbook ran by arturo@nostromo


=== 2020-02-28 ===
=== 2022-10-17 ===
* 22:14 bstorm_: shutting down the old maintain-kubeusers and taking the gloves off the new one (removing --gentle-mode)
* 07:25 taavi: push updated perl532 images [[phab:T320824|T320824]]
* 16:51 bstorm_: node/tools-k8s-worker-15 uncordoned
* 16:44 bstorm_: drained tools-k8s-worker-15 and hard rebooting it because it wasn't happy
* 16:36 bstorm_: rebooting k8s workers 1-35 on the 2020 cluster to clear a strange nologin condition that has been there since the NFS maintenance
* 16:14 bstorm_: rebooted tools-k8s-worker-7 to clear some puppet issues
* 16:00 bd808: Devoicing stashbot in #wikimedia-cloud to reduce irc spam while migrating tools to 2020 Kubernetes cluster
* 15:28 jeh: create OpenStack server group tools-elastic with anti-affinty policy enabled [[phab:T236606|T236606]]
* 15:09 jeh: create 3 new elasticsearch VMs tools-elastic-[1,2,3] [[phab:T236606|T236606]]
* 14:20 jeh: create new puppet prefixes for existing (no change in data) and new elasticsearch VMs
* 04:35 bd808: Joined tools-k8s-worker-54 to 2020 Kubernetes cluster
* 04:34 bd808: Joined tools-k8s-worker-53 to 2020 Kubernetes cluster
* 04:32 bd808: Joined tools-k8s-worker-52 to 2020 Kubernetes cluster
* 04:31 bd808: Joined tools-k8s-worker-51 to 2020 Kubernetes cluster
* 04:28 bd808: Joined tools-k8s-worker-50 to 2020 Kubernetes cluster
* 04:24 bd808: Joined tools-k8s-worker-49 to 2020 Kubernetes cluster
* 04:23 bd808: Joined tools-k8s-worker-48 to 2020 Kubernetes cluster
* 04:21 bd808: Joined tools-k8s-worker-47 to 2020 Kubernetes cluster
* 04:21 bd808: Joined tools-k8s-worker-46 to 2020 Kubernetes cluster
* 04:19 bd808: Joined tools-k8s-worker-45 to 2020 Kubernetes cluster
* 04:14 bd808: Joined tools-k8s-worker-44 to 2020 Kubernetes cluster
* 04:13 bd808: Joined tools-k8s-worker-43 to 2020 Kubernetes cluster
* 04:12 bd808: Joined tools-k8s-worker-42 to 2020 Kubernetes cluster
* 04:10 bd808: Joined tools-k8s-worker-41 to 2020 Kubernetes cluster
* 04:09 bd808: Joined tools-k8s-worker-40 to 2020 Kubernetes cluster
* 04:08 bd808: Joined tools-k8s-worker-39 to 2020 Kubernetes cluster
* 04:07 bd808: Joined tools-k8s-worker-38 to 2020 Kubernetes cluster
* 04:06 bd808: Joined tools-k8s-worker-37 to 2020 Kubernetes cluster
* 03:49 bd808: Joined tools-k8s-worker-36 to 2020 Kubernetes cluster
* 00:50 bstorm_: rebuilt all docker images to include webservice 0.64


=== 2020-02-27 ===
=== 2022-10-14 ===
* 23:27 bstorm_: installed toollabs-webservice 0.64 on the bastions
* 07:54 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0cc020e}}) ([[phab:T311466|T311466]]) - cookbook ran by taavi@runko
* 23:24 bstorm_: pushed toollabs-webservice version 0.64 to all toolforge repos
* 21:03 jeh: add reindex service account to elasticsearch for data migration [[phab:T236606|T236606]]
* 20:57 bstorm_: upgrading toollabs-webservice to stretch-toolsbeta version for jdk8:testing image only
* 20:19 jeh: update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 [[phab:T236606|T236606]]
* 18:53 bstorm_: hard rebooted a rather stuck tools-sgecron-01
* 18:20 bd808: Building tools-k8s-worker-[36-55]
* 17:56 bd808: Deleted instances tools-worker-10[21-40]
* 16:14 bd808: Decommissioning tools-worker-10[21-40]
* 16:02 bd808: Drained tools-worker-1021
* 15:51 bd808: Drained tools-worker-1022
* 15:44 bd808: Drained tools-worker-1023 (there is no tools-worker-1024)
* 15:39 bd808: Drained tools-worker-1025
* 15:39 bd808: Drained tools-worker-1026
* 15:11 bd808: Drained tools-worker-1027
* 15:09 bd808: Drained tools-worker-1028 (there is no tools-worker-1029)
* 15:07 bd808: Drained tools-worker-1030
* 15:06 bd808: Uncordoned tools-worker-10[16-20]. Was over optimistic about repacking legacy Kubernetes cluster into 15 instances. Will keep 20 for now.
* 15:00 bd808: Drained tools-worker-1031
* 14:54 bd808: Hard reboot tools-worker-1016. Direct virsh console unresponsive. Stuck in shutdown since 2020-01-22?
* 14:44 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
* 14:41 bd808: Drained tools-worker-1032
* 14:37 bd808: Drained tools-worker-1033
* 14:35 bd808: Drained tools-worker-1034
* 14:34 bd808: Drained tools-worker-1035
* 14:33 bd808: Drained tools-worker-1036
* 14:33 bd808: Drained tools-worker-10<nowiki>{</nowiki>39,38,37<nowiki>}</nowiki> yesterday but did not !log
* 00:29 bd808: Drained tools-worker-1009 for reboot (NFS flakey)
* 00:11 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
* 00:08 bd808: Uncordoned tools-worker-1002.tools.eqiad.wmflabs
* 00:02 bd808: Rebooting tools-worker-1002
* 00:00 bd808: Draining tools-worker-1002 to reboot for NFS problems


=== 2020-02-26 ===
=== 2022-10-13 ===
* 23:42 bd808: Drained tools-worker-1040
* 15:10 arturo: restart jobs-emailer pod
* 23:41 bd808: Cordoned tools-worker-10[16-40] in preparation for shrinking legacy Kubernetes cluster
* 23:12 bstorm_: replacing all tool limit-ranges in the 2020 cluster with a lower cpu request version
* 22:29 bstorm_: deleted pod maintain-kubeusers-6d9c45f4bc-5bqq5 to deploy new image
* 21:06 bstorm_: deleting loads of stuck grid jobs
* 20:27 jeh: rebooting tools-worker-[1008,1015,1021]
* 20:15 bstorm_: rebooting tools-sgegrid-master because it actually had the permissions thing going on still
* 18:03 bstorm_: downtimed toolschecker for nfs maintenance


=== 2020-02-25 ===
=== 2022-10-12 ===
* 15:31 bd808: `wmcs-k8s-enable-cluster-monitor toolschecker`
* 23:25 bd808: Rebuilding all Toolforge docker images ([[phab:T278436|T278436]], [[phab:T311466|T311466]], [[phab:T293552|T293552]])
* 20:43 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages. Third try seems to be working. ([[phab:T316554|T316554]])
* 20:31 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages after fixing bug in building the bullseye base image. ([[phab:T316554|T316554]])
* 16:26 dcaro: deploy the latest registry admission webhook, now for real (image tag {{Gerrit|07bc7db}})
* 12:48 dcaro: deploy the latest registry admission webhook (image tag {{Gerrit|07bc7db}})
* 09:26 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 09:19 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2020-02-23 ===
=== 2022-10-11 ===
* 00:40 Krenair: [[phab:T245932|T245932]]
* 13:52 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8574c36 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8574c36}}) - cookbook ran by taavi@runko


=== 2020-02-21 ===
=== 2022-10-10 ===
* 16:02 andrewbogott: moving tools-sgecron-01 to cloudvirt1022
* 19:30 taavi: rebooting all k8s worker nodes to clean up labstore1006/7 remains
* 16:51 taavi: clean up labstore1006/7 mounts from k8s control nodes [[phab:T320425|T320425]]
* 11:35 arturo: aborrero@tools-k8s-control-1:~$ sudo -i kubectl -n jobs-emailer rollout restart deployment/jobs-emailer ([[phab:T317998|T317998]])
* 08:44 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|afa90ed}}) ([[phab:T320284|T320284]]) - cookbook ran by taavi@runko
* 08:39 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|afa90ed}}) - cookbook ran by taavi@runko


=== 2020-02-20 ===
=== 2022-10-09 ===
* 14:49 andrewbogott: moving tools-k8s-worker-19 and tools-k8s-worker-18 to cloudvirt1022 (as part of draining 1014)
* 17:29 taavi: kill 10 idle tmux sessions of user 'hoi' on tools-sgebastion-10 [[phab:T320352|T320352]]
* 00:04 Krenair: Shut off tools-puppetmaster-01 - to be deleted in one week [[phab:T245365|T245365]]


=== 2020-02-19 ===
=== 2022-10-07 ===
* 22:05 Krenair: Project-wide hiera change to swap puppetmaster to tools-puppetmaster-02 [[phab:T245365|T245365]]
* 13:02 taavi: taavi@cloudcontrol1005 ~ $ sudo mark_tool --disable oncall # [[phab:T320240|T320240]]
* 15:36 bstorm_: setting 'puppetmaster: tools-puppetmaster-02.tools.eqiad.wmflabs' on tools-sgeexec-0942 to test new puppetmaster on grid [[phab:T245365|T245365]]
* 11:50 arturo: fix invalid yaml format in horizon puppet prefix 'tools-k8s-haproxy' that prevented clean puppet run in the VMs
* 00:59 bd808: Live hacked the "nginx-configuration" ConfigMap for [[phab:T245426|T245426]] (done several hours ago, but I forgot to !log it)


=== 2020-02-18 ===
=== 2022-10-06 ===
* 23:26 bstorm_: added tools-sgegrid-master.tools.eqiad1.wikimedia.cloud and tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud to gridengine admin host lists
* 00:39 bd808: Image rebuild failing with debian apt repo signature issue. Will investigate tomorrow. ([[phab:T316554|T316554]])
* 09:50 arturo: temporarily delete DNS zone tools.wmcloud.org to try re-creating it
* 00:36 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages. ([[phab:T316554|T316554]])
* 00:04 bd808: Building new php74-sssd-base & web images ([[phab:T310435|T310435]])


=== 2020-02-17 ===
=== 2022-10-03 ===
* 18:53 arturo: [[phab:T168677|T168677]] created DNS TXT record _psl.toolforge.org. with value `https://github.com/publicsuffix/list/pull/970`
* 14:36 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|8da432b}}) - cookbook ran by taavi@runko
* 13:22 arturo: relocating tools-sgewebgrid-lighttpd-0914 to cloudvirt1012 to spread same VMs across different hypervisors


=== 2020-02-14 ===
=== 2022-09-28 ===
* 00:38 bd808: Added tools-k8s-worker-35 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # [[phab:T318858|T318858]]
* 00:34 bd808: Added tools-k8s-worker-34 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, [[phab:T318858|T318858]]
* 00:32 bd808: Added tools-k8s-worker-33 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)
* 00:29 bd808: Added tools-k8s-worker-32 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:25 bd808: Added tools-k8s-worker-31 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:25 bd808: Added tools-k8s-worker-30 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:17 bd808: Added tools-k8s-worker-29 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:15 bd808: Added tools-k8s-worker-28 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:13 bd808: Added tools-k8s-worker-27 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:07 bd808: Added tools-k8s-worker-26 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:03 bd808: Added tools-k8s-worker-25 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])


=== 2020-02-13 ===
=== 2022-09-22 ===
* 23:53 bd808: Added tools-k8s-worker-24 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group [[phab:T317438|T317438]]
* 23:50 bd808: Added tools-k8s-worker-23 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy [[phab:T317438|T317438]]
* 23:38 bd808: Added tools-k8s-worker-22 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 21:35 bd808: Deleted tools-sgewebgrid-lighttpd-092<nowiki>{</nowiki>1,2,3,4,5,6,7,8<nowiki>}</nowiki> & tools-sgewebgrid-generic-090<nowiki>{</nowiki>3,4<nowiki>}</nowiki> ([[phab:T244791|T244791]])
* 21:33 bd808: Removed tools-sgewebgrid-lighttpd-092<nowiki>{</nowiki>1,2,3,4,5,6,7,8<nowiki>}</nowiki> & tools-sgewebgrid-generic-090<nowiki>{</nowiki>3,4<nowiki>}</nowiki> from grid engine config ([[phab:T244791|T244791]])
* 17:43 andrewbogott: migrating b24e29d7-a468-4882-9652-{{Gerrit|9863c8acfb88}} to cloudvirt1022


=== 2020-02-12 ===
=== 2022-09-10 ===
* 19:29 bd808: Rebuilding all Docker images to pick up toollabs-webservice (0.63) ([[phab:T244954|T244954]])
* 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko
* 19:15 bd808: Deployed toollabs-webservice (0.63) on bastions ([[phab:T244954|T244954]])
* 00:20 bd808: Depooling tools-sgewebgrid-generic-0903 ([[phab:T244791|T244791]])
* 00:19 bd808: Depooling tools-sgewebgrid-generic-0904 ([[phab:T244791|T244791]])
* 00:14 bd808: Depooling tools-sgewebgrid-lighttpd-0921 ([[phab:T244791|T244791]])
* 00:09 bd808: Depooling tools-sgewebgrid-lighttpd-0922 ([[phab:T244791|T244791]])
* 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0923 ([[phab:T244791|T244791]])
* 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0924 ([[phab:T244791|T244791]])


=== 2020-02-11 ===
=== 2022-09-07 ===
* 23:58 bd808: Depooling tools-sgewebgrid-lighttpd-0925 ([[phab:T244791|T244791]])
* 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks ([[phab:T316854|T316854]])
* 23:56 bd808: Depooling tools-sgewebgrid-lighttpd-0926 ([[phab:T244791|T244791]])
* 23:38 bd808: Depooling tools-sgewebgrid-lighttpd-0927 ([[phab:T244791|T244791]])


=== 2020-02-10 ===
=== 2022-09-06 ===
* 23:39 bstorm_: updated tools-manifest to 0.21 on aptly for stretch
* 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder ([[phab:T316854|T316854]])
* 22:51 bstorm_: all docker images now use webservice 0.62
* 22:01 bd808: Manually starting webservices for tools that were running on tools-sgewebgrid-lighttpd-0928 ([[phab:T244791|T244791]])
* 21:47 bd808: Depooling tools-sgewebgrid-lighttpd-0928 ([[phab:T244791|T244791]])
* 21:25 bstorm_: upgraded toollabs-webservice package for tools to 0.62 [[phab:T244293|T244293]] [[phab:T244289|T244289]] [[phab:T234617|T234617]] [[phab:T156626|T156626]]


=== 2020-02-07 ===
=== 2022-08-25 ===
* 10:55 arturo: drop jessie VM instances tools-prometheus-<nowiki>{</nowiki>01,02<nowiki>}</nowiki> which were shutdown ([[phab:T238096|T238096]])
* 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner [[phab:T293552|T293552]]


=== 2020-02-06 ===
=== 2022-08-24 ===
* 10:44 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/565556 which is a behavior change to the Toolforge front proxy ([[phab:T234617|T234617]])
* 12:20 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|eba66bc}}) - cookbook ran by taavi@runko
* 10:27 arturo: shutdown again tools-prometheus-01, no longer in use ([[phab:T238096|T238096]])
* 12:20 taavi: upgrading ingress-nginx to v1.3
* 05:07 andrewbogott: cleared out old /tmp and /var/log files on tools-sgebastion-07


=== 2020-02-05 ===
=== 2022-08-20 ===
* 11:22 arturo: restarting ferm fleet-wide to account for prometheus servers changed IP (but same hostname) ([[phab:T238096|T238096]])
* 07:44 dcaro_away: all k8s nodes ready now \o/ ([[phab:T315718|T315718]])
* 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up ([[phab:T315718|T315718]])
* 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking ([[phab:T315718|T315718]])


=== 2020-02-04 ===
=== 2022-08-18 ===
* 11:38 arturo: start again tools-prometheus-01 again to sync data to the new tools-prometheus-03/04 VMs ([[phab:T238096|T238096]])
* 14:45 andrewbogott: adding lucaswerkmeister  as projectadmin ([[phab:T314527|T314527]])
* 11:37 arturo: re-create tools-prometheus-03/04 as 'bigdisk2' instances (300GB) [[phab:T238096|T238096]]
* 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair


=== 2020-02-03 ===
=== 2022-08-17 ===
* 14:12 arturo: move tools-prometheus-04 from cloudvirt1022 to cloudvirt1013
* 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # [[phab:T315459|T315459]]
* 12:48 arturo: shutdown tools-prometheus-01 and tools-prometheus-02, after fixing the proxy `tools-prometheus.wmflabs.org` to tools-prometheus-03, data synced ([[phab:T238096|T238096]])
* 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected
* 09:38 arturo: tools-prometheus-01: systemctl stop prometheus@tools. Another try to migrate data to tools-prometheus-<nowiki>{</nowiki>03,04<nowiki>}</nowiki> ([[phab:T238096|T238096]])


=== 2020-01-31 ===
=== 2022-08-16 ===
* 14:06 arturo: leave tools-prometheus-01 as the backend for tools-prometheus.wmflabs.org for the weekend so grafana dashboards keep working ([[phab:T238096|T238096]])
* 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05
* 14:00 arturo: syncing again prometheus data from tools-prometheus-01 to tools-prometheus-0<nowiki>{</nowiki>3,4<nowiki>}</nowiki> due to some inconsistencies preventing prometheus from starting ([[phab:T238096|T238096]])


=== 2020-01-30 ===
=== 2022-08-11 ===
* 21:04 andrewbogott: also apt-get install python3-novaclient  on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam.  Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
* 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 20:39 andrewbogott: apt-get install python3-keystoneclient  on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam.  Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
* 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues
* 16:27 arturo: create VM tools-prometheus-04 as cold standby of tools-prometheus-03 ([[phab:T238096|T238096]])
* 16:25 arturo: point tools-prometheus.wmflabs.org proxy to tools-prometheus-03 ([[phab:T238096|T238096]])
* 13:42 arturo: disable puppet in prometheus servers while syncing metric data ([[phab:T238096|T238096]])
* 13:15 arturo: drop floating IP 185.15.56.60 and FQDN `prometheus.tools.wmcloud.org` because this is not how the prometheus setup is right now. Use a web proxy instead `tools-prometheus-new.wmflabs.org` ([[phab:T238096|T238096]])
* 13:09 arturo: created FQDN `prometheus.tools.wmcloud.org` pointing to IPv4 185.15.56.60 (tools-prometheus-03) to test [[phab:T238096|T238096]]
* 12:59 arturo: associated floating IPv4 185.15.56.60 to tools-prometheus-03 ([[phab:T238096|T238096]])
* 12:57 arturo: created domain `tools.wmcloud.org` in the tools project after some back and forth with designated, permissions and the database. I plan to use this domain to test the new Debian Buster-based prometheus setup ([[phab:T238096|T238096]])
* 10:20 arturo: create new VM instance tools-prometheus-03 ([[phab:T238096|T238096]])


=== 2020-01-29 ===
=== 2022-08-05 ===
* 20:07 bd808: Created <nowiki>{</nowiki>bastion,login,dev<nowiki>}</nowiki>.toolforge.org service names for Toolforge bastions using Horizon & Designate
* 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2020-01-28 ===
=== 2022-08-03 ===
* 13:35 arturo: `aborrero@tools-clushmaster-02:~$ clush -w @exec-stretch 'for i in $(ps aux {{!}} grep [t]ools.j {{!}} awk -F" " "<nowiki>{</nowiki>print \$2<nowiki>}</nowiki>") ; do  echo "killing $i" ; sudo kill $i ; done {{!}}{{!}} true'` ([[phab:T243831|T243831]])
* 15:51 dhinus: recreated jobs-api pods to pick up new ConfigMap
* 15:02 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c47ac41}}) - cookbook ran by fran@MacBook-Pro.station


=== 2020-01-27 ===
=== 2022-07-20 ===
* 07:05 zhuyifei1999_: wrong package. uninstalled. the correct one is bpfcc-tools and seems only available in buster+. [[phab:T115231|T115231]]
* 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
* 07:01 zhuyifei1999_: apt installing bcc on tools-worker-1037 to see who is sending SIGTERM, will uninstall after done. dependency: bin86. [[phab:T115231|T115231]]
* 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2020-01-24 ===
=== 2022-07-19 ===
* 20:58 bd808: Built tools-k8s-worker-21 to test out build script following openstack client upgrade
* 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 15:45 bd808: Rebuilding all Docker containers again because I failed to actually update the build server git clone properly last time I did this
* 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 05:23 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster (take 2)
* 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 04:41 bd808: Rebuilding all Docker images to pick up webservice-python-bootstrap changes
* 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest


=== 2020-01-23 ===
=== 2022-07-17 ===
* 23:38 bd808: Halted tools-k8s-worker build script after first instance (tools-k8s-worker-10) stuck in "scheduling" state for 20 minutes
* 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 23:16 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster
* 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 05:15 bd808: Building tools-elastic-04
* 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 04:39 bd808: wmcs-openstack quota set --instances 192
* 04:36 bd808: wmcs-openstack quota set --cores 768 --ram {{Gerrit|1536000}}


=== 2020-01-22 ===
=== 2022-07-14 ===
* 12:43 arturo: for the record, issue with tools-worker-1016 was memory exhaustion apparently
* 13:48 taavi: rebooting tools-sgeexec-10-2
* 12:35 arturo: hard-reboot tools-worker-1016 (not responding to even console access)


=== 2020-01-21 ===
=== 2022-07-13 ===
* 19:25 bstorm_: hard rebooting tools-sgeexec-0913/14/35 because they aren't even on the network
* 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 19:17 bstorm_: depooled and rebooted tools-sgeexec-0914 because it was acting funny
* 18:30 bstorm_: depooling and rebooting tools-sgeexec-[0911,0913,0919,0921,0924,0931,0933,0935,0939,0941].tools.eqiad.wmflabs
* 17:21 bstorm_: rebooting toolschecker to recover stale nfs handle


=== 2020-01-16 ===
=== 2022-07-11 ===
* 23:54 bstorm_: rebooting tools-docker-builder-06 because there are a couple running containers that don't want to die cleanly
* 16:06 wm-bot2: Increased quotas by <nowiki>{</nowiki>self.increases<nowiki>}</nowiki> ([[phab:T312692|T312692]]) - cookbook ran by nskaggs@x1carbon
* 23:45 bstorm_: rebuilding docker containers to include new webservice version (0.58)
* 23:41 bstorm_: deployed toollabs-webservice 0.58 to everything that isn't a container
* 16:45 bstorm_: ran configurator to set the gridengine web queues to `rerun FALSE` [[phab:T242397|T242397]]


=== 2020-01-14 ===
=== 2022-07-07 ===
* 15:29 bstorm_: failed the gridengine master back to the master server from the shadow
* 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 02:23 andrewbogott: rebooting tools-paws-worker-1006  to resolve hangs associated with an old NFS failure


=== 2020-01-13 ===
=== 2022-06-28 ===
* 17:48 bd808: Running `puppet ca destroy` for each unsigned cert on tools-puppetmaster-01 ([[phab:T242642|T242642]])
* 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T311538|T311538]]) - cookbook ran by dcaro@vulcanus
* 16:42 bd808: Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. [[phab:T242559|T242559]]
* 15:51 taavi: add 4096G cinder quota [[phab:T311509|T311509]]
* 16:33 bd808: Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. [[phab:T242559|T242559]]
* 16:31 bd808: Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. [[phab:T242559|T242559]]
* 16:26 bd808: Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. [[phab:T242559|T242559]]


=== 2020-01-12 ===
=== 2022-06-27 ===
* 22:31 Krenair: same on -13 and -14
* 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
* 22:28 Krenair: same on -8
* 18:02 taavi: switchover active cron server to tools-sgecron-2 [[phab:T284767|T284767]]
* 22:18 Krenair: same on -7
* 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 22:11 Krenair: Did usual new instance creation puppet dance on tools-k8s-worker-6, /data/project got created
* 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:15 taavi: [[phab:T311412|T311412]] updating ca used by k8s-apiserver->etcd communication, breakage may happen
* 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 [[phab:T311412|T311412]]
* 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it [[phab:T311412|T311412]]


=== 2020-01-11 ===
=== 2022-06-23 ===
* 01:33 bstorm_: updated toollabs-webservice package to 0.57, which should allow persisting mem and cpu in manifests with burstable qos.
* 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:59 taavi: removing remaining continuous jobs from the stretch grid [[phab:T277653|T277653]]


=== 2020-01-10 ===
=== 2022-06-22 ===
* 23:31 bstorm_: updated toollabs-webservice package to 0.56
* 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:45 bstorm_: depooled tools-paws-worker-1013 to reboot because I think it is the last tools server with that mount issue (I hope)
* 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:35 bstorm_: depooling and rebooting tools-worker-1016 because it still had the leftover mount problems
* 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:30 bstorm_: git stash-ing local puppet changes in hopes that arturo has that material locally, and it doesn't break anything to do so
* 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2020-01-09 ===
=== 2022-06-21 ===
* 23:35 bstorm_: depooled tools-sgeexec-0939 because it isn't acting right and rebooting it
* 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:26 bstorm_: re-joining the k8s nodes OF THE PAWS CLUSTER to the cluster one at a time to rotate the certs [[phab:T242353|T242353]]
* 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:25 bstorm_: re-joining the k8s nodes to the cluster one at a time to rotate the certs [[phab:T242353|T242353]]
* 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:06 bstorm_: rebooting tools-paws-master-01 [[phab:T242353|T242353]]
* 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:46 bstorm_: refreshing the paws cluster's entire x509 environment [[phab:T242353|T242353]]


=== 2020-01-07 ===
=== 2022-06-03 ===
* 22:40 bstorm_: rebooted tools-worker-1007 to recover it from disk full and general badness
* 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 16:33 arturo: deleted by hand pod metrics/cadvisor-5pd46 due to prometheus having issues scrapping it
* 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor [[phab:T309821|T309821]]
* 15:46 bd808: Rebooting tools-k8s-worker-[6-14]
* 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 15:35 bstorm_: changed kubeadm-config to use a list instead of a hash for extravols on the apiserver in the new k8s cluster [[phab:T242067|T242067]]
* 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 14:02 arturo: `root@tools-k8s-control-3:~# wmcs-k8s-secret-for-cert -n metrics -s metrics-server-certs -a metrics-server` ([[phab:T241853|T241853]])
* 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 13:33 arturo: upload docker-registry.tools.wmflabs.org/coreos/kube-state-metrics:v1.8.0 copied from quay.io/coreos/kube-state-metrics:v1.8.0 ([[phab:T241853|T241853]])
* 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online [[phab:T309821|T309821]]
* 13:31 arturo: upload docker-registry.tools.wmflabs.org/metrics-server-amd64:v0.3.6 copied from k8s.gcr.io/metrics-server-amd64:v0.3.6 ([[phab:T241853|T241853]])
* 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 13:23 arturo: [new k8s] doing changes to kube-state-metrics and metrics-server trying to relocate them to the 'metrics' namespace ([[phab:T241853|T241853]])
* 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 05:28 bd808: Creating tools-k8s-worker-[6-14] (again)
* 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 05:20 bd808: Deleting busted tools-k8s-worker-[6-14]
* 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
* 05:02 bd808: Creating tools-k8s-worker-[6-14]
* 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 00:26 bstorm_: repooled tools-sgewebgrid-lighttpd-0919
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor [[phab:T309821|T309821]]
* 00:17 bstorm_: repooled tools-sgewebgrid-lighttpd-0918
* 15:50 balloons: temp add 1.0G swap to sgeweblight hosts [[phab:T309821|T309821]]
* 00:15 bstorm_: moving tools-sgewebgrid-lighttpd-0918 and -0919 to cloudvirt1004 from cloudvirt1029 to rebalance load
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
* 00:02 bstorm_: depooled tools-sgewebgrid-lighttpd-0918 and 0919 to move to cloudvirt1004 to improve spread
* 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
* 13:25 bd808: Upgrading fleet to tools-webservice 0.86 ([[phab:T309821|T309821]])
* 13:20 bd808: publish tools-webservice 0.86 ([[phab:T309821|T309821]])
* 12:46 taavi: start webservicemonitor on tools-sgecron-01 [[phab:T309821|T309821]]
* 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
* 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid [[phab:T309821|T309821]]
* 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
* 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package ([[phab:T309821|T309821]])
* 03:10 bd808: publish tools-webservice 0.85 with hack for [[phab:T309821|T309821]]


=== 2020-01-06 ===
=== 2022-06-02 ===
* 23:40 bd808: Deleted tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
* 23:36 bd808: Shutdown tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
* 23:34 bd808: Decommissioned tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
* 23:13 bstorm_: Repooled tools-sgeexec-0922 because I don't know why it was depooled
* 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
* 23:01 bd808: Depooled tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs
* 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 22:58 bd808: Depooling tools-sgewebgrid-lighttpd-090[2-9]
* 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
* 22:57 bd808: Disabling queues on tools-sgewebgrid-lighttpd-090[2-9]
* 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 21:07 bd808: Restarted kube2proxy on tools-proxy-05 to try and refresh admin tool's routes
* 12:03 dcaro: refresh prometheus certs ([[phab:T308402|T308402]])
* 18:54 bstorm_: edited /etc/fstab to remove NFS and unmounted the nfs volumes tools-k8s-haproxy-1 [[phab:T241908|T241908]]
* 11:47 dcaro: refresh registry-admission-controller certs ([[phab:T308402|T308402]])
* 18:49 bstorm_: edited /etc/fstab to remove NFS and rebooted to clear stale mounts on tools-k8s-haproxy-2 [[phab:T241908|T241908]]
* 11:42 dcaro: refresh ingress-admission-controller certs ([[phab:T308402|T308402]])
* 18:47 bstorm_: added mount_nfs=false to tools-k8s-haproxy puppet prefix [[phab:T241908|T241908]]
* 11:36 dcaro: refresh volume-admission-controller certs ([[phab:T308402|T308402]])
* 18:24 bd808: Deleted shutdown instance tools-worker-1029 (was an SSSD testing instance)
* 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 16:42 bstorm_: failed sge-shadow-master back to the main grid master
* 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 16:42 bstorm_: Removed files for old S1tty that wasn't working on sge-grid-master
* 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2020-01-04 ===
=== 2022-06-01 ===
* 18:11 bd808: Shutdown tools-worker-1029
* 11:18 taavi: depool and remove tools-sgeexec-09[07-14]
* 18:10 bd808: kubectl delete node tools-worker-1029.tools.eqiad.wmflabs
* 18:06 bd808: Removed tools-worker-1029.tools.eqiad.wmflabs from k8s::worker_hosts hiera in preparation for decom
* 16:54 bstorm_: moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors [[phab:T241884|T241884]]
* 16:47 bstorm_: moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors [[phab:T241884|T241884]]
* 16:16 bd808: Draining tools-worker-10<nowiki>{</nowiki>05,12,28<nowiki>}</nowiki> due to hardware errors ([[phab:T241884|T241884]])
* 16:13 arturo: moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:11 arturo: moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:09 arturo: moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:08 arturo: moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:07 arturo: moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:06 arturo: moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:04 arturo: moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:02 arturo: moving VM tools-sgeexec-0910 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241873|T241873]])


=== 2020-01-03 ===
=== 2022-05-31 ===
* 16:48 bstorm_: updated the ValidatingWebhookConfiguration for the ingress admission controller to the working settings
* 16:51 taavi: delete tools-sgeexec-0904 for [[phab:T309525|T309525]] experimentation
* 11:51 arturo: [new k8s] deploy cadvisor as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/561654 ([[phab:T237643|T237643]])
* 11:21 arturo: upload k8s.gcr.io/cadvisor:v0.30.2 docker image to the docker registry as docker-registry.tools.wmflabs.org/cadvisor:0.30.2 for [[phab:T237643|T237643]]
* 03:04 bd808: Really rebuilding all <nowiki>{</nowiki>jessie,stretch,buster<nowiki>}</nowiki>-sssd images. Last time I forgot to actually update the git clone.
* 00:11 bd808: Rebuiliding all stretch-ssd Docker images to pick up busybox


=== 2020-01-02 ===
=== 2022-05-30 ===
* 23:54 bd808: Rebuiliding all buster-ssd Docker images to pick up busybox
* 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) [[phab:T277653|T277653]]


=== 2019-12-30 ===
=== 2022-05-26 ===
* 05:02 andrewbogott: moving tools-worker-1012 to cloudvirt1024 for [[phab:T241523|T241523]]
* 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T309146|T309146]]) - cookbook ran by taavi@runko
* 04:49 andrewbogott: draining and rebooting tools-worker-1031, its drive is full


=== 2019-12-29 ===
=== 2022-05-22 ===
* 01:38 Krenair: Cordoned tools-worker-1012 and deleted pods associated with dplbot and dewikigreetbot as well as my own testing one, host seems to be under heavy load - [[phab:T241523|T241523]]
* 17:04 taavi: failover tools-redis to the updated cluster [[phab:T278541|T278541]]
* 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud ([[phab:T308982|T308982]]) - cookbook ran by taavi@runko


=== 2019-12-27 ===
=== 2022-05-16 ===
* 15:06 Krenair: Killed a "python parse_page.py outreachy" process by aikochou that was hogging IO on tools-sgebastion-07
* 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|7037eca}}) - cookbook ran by taavi@runko


=== 2019-12-25 ===
=== 2022-05-14 ===
* 16:07 zhuyifei1999_: pkilled 5 `python pwb.py` processes belonging to `tools.kaleem-bot` on tools-sgebastion-07
* 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940


=== 2019-12-22 ===
=== 2022-05-12 ===
* 20:13 bd808: Enabled Puppet on tools-proxy-06.tools.eqiad.wmflabs after nginx config test ([[phab:T241310|T241310]])
* 12:36 taavi: re-enable CronJobControllerV2 [[phab:T308205|T308205]]
* 18:52 bd808: Disabled Puppet on tools-proxy-06.tools.eqiad.wmflabs to test nginx config change ([[phab:T241310|T241310]])
* 09:28 taavi: deploy jobs-api update [[phab:T308204|T308204]]
* 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T308204|T308204]]) - cookbook ran by taavi@runko


=== 2019-12-20 ===
=== 2022-05-10 ===
* 22:28 bd808: Re-enabled Puppet on tools-sgebastion-09. Reason for disable was "arturo raising systemd limits"
* 15:18 taavi: depool tools-k8s-worker-42 for experiments
* 11:33 arturo: reboot tools-k8s-control-3 to fix some stale NFS mount issues
* 13:54 taavi: enable distro-wikimedia unattended upgrades [[phab:T290494|T290494]]


=== 2019-12-18 ===
=== 2022-05-06 ===
* 17:33 bstorm_: updated package in aptly for toollabs-webservice to 0.53
* 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl ([[phab:T307812|T307812]])
* 11:49 arturo: introduce placeholder DNS records for toolforge.org domain. No services are provided under this domain yet for end users, this is just us testing (SSL, proxy stuff etc). This may be reverted anytime.


=== 2019-12-17 ===
=== 2022-05-05 ===
* 20:25 bd808: Fixed https://tools.wmflabs.org/ to redirect to https://tools.wmflabs.org/admin/
* 17:28 taavi: deploy tools-webservice 0.83 [[phab:T307693|T307693]]
* 19:21 bstorm_: deployed the changes to the live proxy to enable the new kubernetes cluster [[phab:T234037|T234037]]
* 16:53 bstorm_: maintain-kubeusers app deployed fully in tools for new kubernetes cluster [[phab:T214513|T214513]] [[phab:T228499|T228499]]
* 16:50 bstorm_: updated the maintain-kubeusers docker image for beta and tools
* 04:48 bstorm_: completed first run of maintain-kubeusers 2 in the new cluster [[phab:T214513|T214513]]
* 01:26 bstorm_: running the first run of maintain-kubeusers 2.0 for the new cluster [[phab:T214513|T214513]] (more successfully this time)
* 01:25 bstorm_: unset the immutable bit from 1704 tool kubeconfigs [[phab:T214513|T214513]]
* 01:05 bstorm_: beginning the first run of the new maintain-kubeusers in gentle-mode -- but it was just killed by some files setting the immutable bit [[phab:T214513|T214513]]
* 00:45 bstorm_: enabled encryption at rest on the new k8s cluster


=== 2019-12-16 ===
=== 2022-05-03 ===
* 22:04 bd808: Added 'ALLOW IPv4 25/tcp from 0.0.0.0/0' to "MTA" security group applied to tools-mail-02
* 08:20 taavi: redis: start replication from the old cluster to the new one ([[phab:T278541|T278541]])
* 19:05 bstorm_: deployed the maintain-kubeusers operations pod to the new cluster


=== 2019-12-14 ===
=== 2022-05-02 ===
* 10:48 valhallasw`cloud: re-enabling puppet on tools-sgeexec-0912, likely left-over from NFS maintenance (no reason was specified).
* 08:54 taavi: restart acme-chief.service [[phab:T307333|T307333]]


=== 2019-12-13 ===
=== 2022-04-25 ===
* 18:46 bstorm_: updated tools-k8s-control-2 and 3 to the new config as well
* 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 ([[phab:T214343|T214343]])
* 17:56 bstorm_: updated tools-k8s-control-1 to the new control plane configuration
* 14:46 bd808: Building toolforge-webservice v0.82
* 17:47 bstorm_: edited kubeadm-config configMap object to match the new init config
* 17:32 bstorm_: rebooting tools-k8s-control-2 to correct mount issue
* 00:45 bstorm_: rebooting tools-static-13
* 00:28 bstorm_: rebooting the k8s master to clear NFS errors
* 00:15 bstorm_: switch tools-acme-chief config to match the new authdns_servers format upstream


=== 2019-12-12 ===
=== 2022-04-23 ===
* 23:36 bstorm_: rebooting toolschecker after downtiming the services
* 16:51 bd808: Built new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images and pushed to registry ([[phab:T214343|T214343]])
* 22:58 bstorm_: rebooting tools-acme-chief-01
* 22:53 bstorm_: rebooting the cron server, tools-sgecron-01 as it wasn't recovered from last night's maintenance
* 11:20 arturo: rolling reboot for all grid & k8s worker nodes due to NFS staleness
* 09:22 arturo: reboot tools-sgeexec-0911 to try fixing weird NFS state
* 08:46 arturo: doing `run-puppet-agent` in all VMs to see state of NFS
* 08:34 arturo: reboot tools-worker-1033/1034 and tools-sgebastion-08 to try to correct NFS mount issues


=== 2019-12-11 ===
=== 2022-04-20 ===
* 18:13 bd808: Restarted maintain-dbusers on labstore1004. Process had not logged any account creations since 2019-12-01T22:45:45.
* 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?)
* 17:24 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1031
* 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8f37a04}}) - cookbook ran by taavi@runko


=== 2019-12-10 ===
=== 2022-04-16 ===
* 13:59 arturo: set pod replicas to 3 in the new k8s cluster ([[phab:T239405|T239405]])
* 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics ({{Gerrit|2c485e9}}) - cookbook ran by taavi@runko


=== 2019-12-09 ===
=== 2022-04-12 ===
* 11:06 andrewbogott: deleting unused security groups: catgraph, devpi, MTA, mysql, syslog, test    [[phab:T91619|T91619]]
* 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' ([[phab:T305986|T305986]])
* 21:27 bd808: Added komla to 'roots' sudoers policy ([[phab:T305986|T305986]])
* 21:24 bd808: Add komla as projectadmin ([[phab:T305986|T305986]])


=== 2019-12-04 ===
=== 2022-04-10 ===
* 13:45 arturo: drop puppet prefix `tools-cron`, deprecated and no longer in use
* 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)


=== 2019-11-29 ===
=== 2022-04-09 ===
* 11:45 arturo: created 3 new VMs `tools-k8s-worker-[3,4,5]` ([[phab:T239403|T239403]])
* 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /
* 10:28 arturo: re-arm keyholder in tools-acme-chief-01 (password in labs/private.git @ tools-puppetmaster-01)
* 10:27 arturo: re-arm keyholder in tools-acme-chief-02 (password in labs/private.git @ tools-puppetmaster-01)


=== 2019-11-26 ===
=== 2022-04-08 ===
* 23:25 bstorm_: rebuilding docker images to include the new webservice 0.52 in all versions instead of just the stretch ones [[phab:T236202|T236202]]
* 10:44 arturo: disabled debug mode on the k8s jobs-emailer component
* 22:57 bstorm_: push upgraded webservice 0.52 to the buster and jessie repos for container rebuilds [[phab:T236202|T236202]]
* 19:55 phamhi: drained tools-worker-1002,8,15,32 to rebalance the cluster
* 19:45 phamhi: cleaned up container that was taken up 16G of disk space on tools-worker-1020 in order to re-run puppet client
* 14:01 arturo: drop hiera references to `tools-test-proxy-01.tools.eqiad.wmflabs`. Such VM no longer exists
* 14:00 arturo: introduce the `profile::toolforge::proxies` hiera key in the global puppet config


=== 2019-11-25 ===
=== 2022-04-05 ===
* 10:35 arturo: refresh puppet certs for tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])
* 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 10:35 arturo: add puppet cert SANs via instance hiera to tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])
* 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 07:21 arturo: deploying toolforge-jobs-framework-cli v7


=== 2019-11-22 ===
=== 2022-04-04 ===
* 13:32 arturo: created security group `tools-new-k8s-full-connectivity` and add new k8s VMs to it ([[phab:T238654|T238654]])
* 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 05:55 jeh: add Riley Huntley `riley` to base tools project
* 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions


=== 2019-11-21 ===
=== 2022-03-28 ===
* 12:48 arturo: reboot the new k8s cluster after the upgrade
* 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud ([[phab:T304816|T304816]]) - cookbook ran by arturo@nostromo
* 11:49 arturo: upgrading new k8s kubectl version to 1.15.6 ([[phab:T238654|T238654]])
* 11:44 arturo: upgrading new k8s kubelet version to 1.15.6 ([[phab:T238654|T238654]])
* 10:29 arturo: upgrading new k8s cluster version to 1.15.6 using kubeadm ([[phab:T238654|T238654]])
* 10:28 arturo: install kubeadm 1.15.6 on worker/control nodes in the new k8s cluster ([[phab:T238654|T238654]])


=== 2019-11-19 ===
=== 2022-03-15 ===
* 13:49 arturo: re-create nginx-ingress pod due to deployment template refresh ([[phab:T237643|T237643]])
* 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|084ee51}}) - cookbook ran by arturo@nostromo
* 12:46 arturo: deploy changes to tools-prometheus to account for the new k8s cluster ([[phab:T237643|T237643]])
* 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...)


=== 2019-11-15 ===
=== 2022-03-14 ===
* 14:44 arturo: stop live-hacks on tools-prometheus-01 [[phab:T237643|T237643]]
* 11:44 arturo: deploy jobs-framework-emailer {{Gerrit|9470a5f339fd5a44c97c69ce97239aef30f5ee41}} ([[phab:T286135|T286135]])
* 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo ([[phab:T297090|T297090]])


=== 2019-11-13 ===
=== 2022-03-10 ===
* 17:20 arturo: live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster ([[phab:T237643|T237643]])
* 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902


=== 2019-11-12 ===
=== 2022-03-01 ===
* 12:52 arturo: reboot tools-proxy-06 to reset iptables setup [[phab:T238058|T238058]]
* 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state ([[phab:T302702|T302702]])
* 12:11 dcaro: Cleared error state queues for sgeexec-0916 ([[phab:T302702|T302702]])
* 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand


=== 2019-11-10 ===
=== 2022-02-28 ===
* 02:17 bd808: Building new Docker images for [[phab:T237836|T237836]] (retrying after cleaning out old images on tools-docker-builder-06)
* 08:02 taavi: reboot sgeexec-0916
* 02:15 bd808: Cleaned up old images on tools-docker-builder-06 using instructions from https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images
* 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /
* 02:10 bd808: Building new Docker images for [[phab:T237836|T237836]]
* 01:45 bstorm_: deploying bugfix for webservice in tools and toolsbeta [[phab:T237836|T237836]]


=== 2019-11-08 ===
=== 2022-02-17 ===
* 22:47 bstorm_: adding rsync::server::wrap_with_stunnel: false to the tools-docker-registry-03/4 servers to unbreak puppet
* 08:23 taavi: deleted tools-clushmaster-02
* 18:40 bstorm_: pushed new webservice package to the bastions [[phab:T230961|T230961]]
* 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access
* 18:37 bstorm_: pushed new webservice package supporting buster containers to repo [[phab:T230961|T230961]]
* 18:36 bstorm_: pushed buster-sssd images to the docker repo
* 17:15 phamhi: pushed new buster images with the prefix name "toolforge"


=== 2019-11-07 ===
=== 2022-02-16 ===
* 13:27 arturo: deployed registry-admission-webhook and ingress-admission-controller into the new k8s cluster ([[phab:T236826|T236826]])
* 00:12 bd808: Image builds completed.
* 13:01 arturo: creating puppet prefix `tools-k8s-worker` and a couple of VMs `tools-k8s-worker-[1,2]` [[phab:T236826|T236826]]
* 12:57 arturo: increasing project quota [[phab:T237633|T237633]]
* 11:54 arturo: point `k8s.tools.eqiad1.wikimedia.cloud` to tools-k8s-haproxy-1 [[phab:T236826|T236826]]
* 11:43 arturo: create VMs `tools-k8s-haproxy-[1,2]` [[phab:T236826|T236826]]
* 11:43 arturo: create puppet prefix `tools-k8s-haproxy` [[phab:T236826|T236826]]


=== 2019-11-06 ===
=== 2022-02-15 ===
* 22:32 bstorm_: added rsync::server::wrap_with_stunnel: false to tools-sge-services prefix to fix puppet
* 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
* 21:33 bstorm_: docker images needed for kubernetes cluster upgrade deployed [[phab:T215531|T215531]]
* 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
* 20:26 bstorm_: building and pushing docker images needed for kubernetes cluster upgrade
* 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
* 16:10 arturo: new k8s cluster control nodes are bootstrapped ([[phab:T236826|T236826]])
* 22:50 bd808: Built new toollabs-webservice 0.81
* 13:51 arturo: created FQDN `k8s.tools.eqiad1.wikimedia.cloud` pointing to `tools-k8s-control-1` for the initial bootstrap ([[phab:T236826|T236826]])
* 18:43 bd808: Enabled puppet on tools-proxy-05
* 13:50 arturo: created 3 VMs`tools-k8s-control-[1,2,3]` ([[phab:T236826|T236826]])
* 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
* 13:43 arturo: created `tools-k8s-control` puppet prefix [[phab:T236826|T236826]]
* 18:21 taavi: delete tools-package-builder-03
* 11:57 phamhi: restarted all webservices in grid ([[phab:T233347|T233347]])
* 11:49 arturo: invalidate sssd cache in all bastions to debug [[phab:T301736|T301736]]
* 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for [[phab:T301736|T301736]]
* 11:15 arturo: reboot tools-sgebastion-10 for [[phab:T301736|T301736]]


=== 2019-11-05 ===
=== 2022-02-10 ===
* 23:08 Krenair: Dropped {{Gerrit|59a77a3}}, {{Gerrit|3830802}}, and {{Gerrit|83df61f}} from tools-puppetmaster-01:/var/lib/git/labs/private cherry-picks as these are no longer required [[phab:T206235|T206235]]
* 15:07 taavi: shutdown tools-clushmaster-02 [[phab:T298191|T298191]]
* 22:49 Krenair: Disassociated floating IP 185.15.56.60 from tools-static-13, traffic to this host goes via the project-proxy now. DNS was already changed a few days ago. [[phab:T236952|T236952]]
* 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 22:35 bstorm_: upgraded libpython3.4 libpython3.4-dbg libpython3.4-minimal libpython3.4-stdlib python3.4 python3.4-dbg python3.4-minimal to fix an old broken patch [[phab:T237468|T237468]]
* 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 22:12 bstorm_: pushed docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to the registry to deploy in toolsbeta
* 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 17:38 phamhi: restarted lighttpd based webservice pods on tools-worker-103x and 1040 ([[phab:T233347|T233347]])
* 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 17:34 phamhi: restarted lighttpd based webservice pods on tools-worker-102[0-9] ([[phab:T233347|T233347]])
* 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 17:06 phamhi: restarted lighttpd based webservice pods on tools-worker-101[0-9] ([[phab:T233347|T233347]])
* 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 16:44 phamhi: restarted lighttpd based webservice pods on tools-worker-100[1-9] ([[phab:T233347|T233347]])
* 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:55 arturo: created 3 new VMs: `tools-k8s-etcd-[4,5,6]` [[phab:T236826|T236826]]
* 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally [[phab:T214427|T214427]]
* 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - [[phab:T214427|T214427]]
* 08:06 taavi: disable puppet globally for enabling puppetdb [[phab:T214427|T214427]]


=== 2019-11-04 ===
=== 2022-02-09 ===
* 14:45 phamhi: Built and pushed ruby25 docker image based on buster ([[phab:T230961|T230961]])
* 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet [[phab:T214427|T214427]]
* 14:45 phamhi: Built and pushed golang111 docker image based on buster ([[phab:T230961|T230961]])
* 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] ([[phab:T277653|T277653]]) - cookbook ran by arturo@nostromo
* 14:45 phamhi: Built and pushed jdk11 docker image based on buster ([[phab:T230961|T230961]])
* 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 14:45 phamhi: Built and pushed php73 docker image based on buster ([[phab:T230961|T230961]])
* 18:25 arturo: ignore last message
* 11:10 phamhi: Built and pushed python37 docker image based on buster ([[phab:T230961|T230961]])
* 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 [[phab:T298191|T298191]]


=== 2019-11-01 ===
=== 2022-02-07 ===
* 21:00 Krenair: Removed tools-checker.wmflabs.org A record to 208.80.155.229 as that target IP is in the old pre-neutron range that is no longer routed
* 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository ([[phab:T288406|T288406]])
* 20:57 Krenair: Removed trusty.tools.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 12:52 taavi: updated maintain-kubeusers for [[phab:T301081|T301081]]
* 20:56 Krenair: Removed tools-trusty.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 20:38 Krenair: Updated A record for tools-static.wmflabs.org to point towards project-proxy [[phab:T236952|T236952]]


=== 2019-10-31 ===
=== 2022-02-04 ===
* 18:47 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1001.  Runaway logfiles filled up the drive which prevented puppet from running.  If puppet had run, it would have prevented the runaway logfiles.
* 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with [[phab:T301015|T301015]]
* 13:59 arturo: update puppet prefix `tools-k8s-etcd-` to use the `role::wmcs::toolforge::k8s::etcd` [[phab:T236826|T236826]]
* 21:36 taavi: clear error state from some webgrid nodes
* 13:41 arturo: disabling puppet in tools-k8s-etcd- nodes to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/546995
* 10:15 arturo: SSL cert replacement for tools-docker-registry and tools-k8s-master went fine apparently ([[phab:T236962|T236962]])
* 10:02 arturo: icinga downtime toolschecker for 1h for replacing SSL certs in tools-docker-registry and tools-k8s-master ([[phab:T236962|T236962]])


=== 2019-10-30 ===
=== 2022-02-03 ===
* 13:53 arturo: replacing SSL cert in tools-proxy-x server apparently OK (merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679) [[phab:T235252|T235252]]
* 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
* 13:48 arturo: replacing SSL cert in tools-proxy-x server (live-hacking https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679 first for testing) [[phab:T235252|T235252]]
* 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate
* 13:40 arturo: icinga downtime toolschecker for 1h for replacing SSL cert [[phab:T235252|T235252]]


=== 2019-10-29 ===
=== 2022-01-30 ===
* 10:49 arturo: deleting VMs tools-test-proxy-01, no longer in use
* 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover [[phab:T278541|T278541]]
* 10:07 arturo: deleting old jessie VMs tools-proxy-03/04 [[phab:T235627|T235627]]
* 14:22 taavi: creating a cluster of 3 bullseye redis hosts for [[phab:T278541|T278541]]


=== 2019-10-28 ===
=== 2022-01-26 ===
* 16:06 arturo: delete VM instance `tools-test-proxy-01` and the puppet prefix `tools-test-proxy`
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
* 15:54 arturo: tools-proxy-05 has now the 185.15.56.11 floating IP as active proxy. Old one 185.15.56.6 has been freed [[phab:T235627|T235627]]
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
* 15:54 arturo: shutting down tools-proxy-03 [[phab:T235627|T235627]]
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
* 15:26 bd808: Killed all processes owned by jem on tools-sgebastion-08
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
* 15:16 arturo: tools-proxy-05 has now the 185.15.56.5 floating IP as active proxy [[phab:T235627|T235627]]
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
* 15:14 arturo: refresh hiera to use tools-proxy-05 as active proxy [[phab:T235627|T235627]]
* 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
* 15:11 bd808: Killed ircbot.php processes started by jem on tools-sgebastion-08 per request on irc
* 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
* 14:58 arturo: added `webproxy` security group to tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
* 14:57 phamhi: drained tools-worker-1031.tools.eqiad.wmflabs to clean up disk space
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
* 14:45 arturo: created VMs tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
* 14:43 arturo: adding `role::wmcs::toolforge::proxy` to the `tools-proxy` puppet prefix ([[phab:T235627|T235627]])
* 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes ([[phab:T277653|T277653]])
* 14:42 arturo: deleted `role::toollabs::proxy` from the `tools-proxy` puppet profile ([[phab:T235627|T235627]])
* 14:34 arturo: icinga downtime toolschecker for 1h ([[phab:T235627|T235627]])
* 12:25 arturo: upload image `coredns` v1.3.1 ({{Gerrit|eb516548c180}}) to docker registry ([[phab:T236249|T236249]])
* 12:23 arturo: upload image `kube-apiserver` v1.15.1 ({{Gerrit|68c3eb07bfc3}}) to docker registry ([[phab:T236249|T236249]])
* 12:22 arturo: upload image `kube-controller-manager` v1.15.1 ({{Gerrit|d75082f1d121}}) to docker registry ([[phab:T236249|T236249]])
* 12:20 arturo: upload image `kube-proxy` v1.15.1 ({{Gerrit|89a062da739d}}) to docker registry ([[phab:T236249|T236249]])
* 12:19 arturo: upload image `kube-scheduler` v1.15.1 ({{Gerrit|b0b3c4c404da}}) to docker registry ([[phab:T236249|T236249]])
* 12:04 arturo: upload image `calico/node` v3.8.0 ({{Gerrit|cd3efa20ff37}}) to docker registry ([[phab:T236249|T236249]])
* 12:03 arturo: upload image `calico/calico/pod2daemon-flexvol` v3.8.0 ({{Gerrit|f68c8f870a03}}) to docker registry ([[phab:T236249|T236249]])
* 12:01 arturo: upload image `calico/cni` v3.8.0 ({{Gerrit|539ca36a4c13}}) to docker registry ([[phab:T236249|T236249]])
* 11:58 arturo: upload image `calico/kube-controllers` v3.8.0 ({{Gerrit|df5ff96cd966}}) to docker registry ([[phab:T236249|T236249]])
* 11:47 arturo: upload image `nginx-ingress-controller` v0.25.1 ({{Gerrit|0439eb3e11f1}}) to docker registry ([[phab:T236249|T236249]])


=== 2019-10-24 ===
=== 2022-01-25 ===
* 16:32 bstorm_: set the prod rsyslog config for kubernetes to false for Toolforge
* 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 11:44 arturo: rebooting buster exec nodes
* 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4


=== 2019-10-23 ===
=== 2022-01-24 ===
* 20:00 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.47 ([[phab:T233347|T233347]])
* 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 12:09 phamhi: Deployed toollabs-webservice 0.47 to buster-tools and stretch-tools ([[phab:T233347|T233347]])
* 15:23 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])
* 09:13 arturo: 9 tools-sgeexec nodes and 6 other related VMs are down because hypervisor is rebooting
* 09:03 arturo: tools-sgebastion-08 is down because hypervisor is rebooting


=== 2019-10-22 ===
=== 2022-01-20 ===
* 16:56 bstorm_: drained tools-worker-1025.tools.eqiad.wmflabs which was malfunctioning
* 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
* 09:25 arturo: created the `tools.eqiad1.wikimedia.cloud.` DNS zone
* 12:56 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2019-10-21 ===
=== 2022-01-19 ===
* 17:32 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.46
* 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move


=== 2019-10-18 ===
=== 2022-01-14 ===
* 22:15 bd808: Rescheduled continuous jobs away from tools-sgeexec-0904 because of high system load
* 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, [[phab:T299243|T299243]]
* 22:09 bd808: Cleared error state of webgrid-generic@tools-sgewebgrid-generic-0901, webgrid-lighttpd@tools-sgewebgrid-lighttpd-09{12,15,19,20,26}
* 21:29 bd808: Rescheduled all grid engine webservice jobs ([[phab:T217815|T217815]])


=== 2019-10-16 ===
=== 2022-01-12 ===
* 16:21 phamhi: Deployed toollabs-webservice 0.46 to buster-tools and stretch-tools ([[phab:T218461|T218461]])
* 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
* 09:29 arturo: toolforge is recovered from the reboot of cloudvirt1029
* 11:03 arturo: created puppet prefix 'tools-sgeweblig'
* 09:17 arturo: due to the reboot of cloudvirt1029, several sgeexec nodes (8) are offline, also sgewebgrid-lighttpd (8) and tools-worker (3) and the main toolforge proxy (tools-proxy-03)
* 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'


=== 2019-10-15 ===
=== 2022-01-04 ===
* 17:10 phamhi: restart tools-worker-1035 because it is no longer responding
* 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
 
* 08:12 taavi: disable puppet & exim4 on [[phab:T298501|T298501]]
=== 2019-10-14 ===
* 09:26 arturo: cleaned-up updatetools from tools-sge-services nodes ([[phab:T229261|T229261]])
 
=== 2019-10-11 ===
* 19:52 bstorm_: restarted docker on tools-docker-builder after phamhi noticed the daemon had a routing issue (blank iptables)
* 11:55 arturo: create tools-test-proxy-01 VM for testing [[phab:T235059|T235059]] and a puppet prefix for it
* 10:53 arturo: added kubernetes-node_1.4.6-7_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:51 arturo: added docker-engine_1.12.6-0~debian-jessie_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:46 arturo: added logster_0.0.10-2~jessie1_all.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
 
=== 2019-10-10 ===
* 02:33 bd808: Rebooting tools-sgewebgrid-lighttpd-0903. Instance hung.
 
=== 2019-10-09 ===
* 22:52 jeh: removing test instances tools-sssd-sgeexec-test-[12] from SGE
* 15:32 phamhi: drained tools-worker-1020/23/33/35/36/40 to rebalance the cluster
* 14:46 phamhi: drained and cordoned tools-worker-1029 after status reset on reboot
* 12:37 arturo: drain tools-worker-1038 to rebalance load in the k8s cluster
* 12:35 arturo: uncordon tools-worker-1029 (was disabled for unknown reasons)
* 12:33 arturo: drain tools-worker-1010 to rebalance load
* 10:33 arturo: several sgewebgrid-lighttpd nodes (9) not available because cloudvirt1013 is rebooting
* 10:21 arturo: several worker nodes (7) not available because cloudvirt1012 is rebooting
* 10:08 arturo: several worker nodes (6) not available because cloudvirt1009 is rebooting
* 09:59 arturo: several worker nodes (5) not available because cloudvirt1008 is rebooting
 
=== 2019-10-08 ===
* 19:40 bstorm_: drained tools-worker-1007/8 to rebalance the cluster
* 19:34 bstorm_: drained tools-worker-1009 and then 1014 for rebalancing
* 19:27 bstorm_: drained tools-worker-1005 for rebalancing (and put these back in service as I went)
* 19:24 bstorm_: drained tools-worker-1003 and 1009 for rebalancing
* 15:41 arturo: deleted VM instance tools-sgebastion-0test. No longer in use.
 
=== 2019-10-07 ===
* 20:17 bd808: Dropped backlog of messages for delivery to tools.usrd-tools
* 20:16 bd808: Dropped backlog of messages for delivery to tools.mix-n-match
* 20:13 bd808: Dropped backlog of frozen messages for delivery (240 dropped)
* 19:25 bstorm_: deleted tools-puppetmaster-02
* 19:20 Krenair: reboot tools-k8s-master-01 due to nfs stale issue
* 19:18 Krenair: reboot tools-paws-worker-1006 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1040 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1039 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1038 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1037 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1036 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1035 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1034 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1033 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1032 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1031 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1030 due to nfs stale issue
* 19:10 Krenair: reboot tools-puppetmaster-02 due to nfs stale issue
* 19:09 Krenair: reboot tools-sgebastion-0test due to nfs stale issue
* 19:08 Krenair: reboot tools-sgebastion-09 due to nfs stale issue
* 19:08 Krenair: reboot tools-sge-services-04 due to nfs stale issue
* 19:07 Krenair: reboot tools-paws-worker-1002 due to nfs stale issue
* 19:06 Krenair: reboot tools-mail-02 due to nfs stale issue
* 19:06 Krenair: reboot tools-docker-registry-03 due to nfs stale issue
* 19:04 Krenair: reboot tools-worker-1029 due to nfs stale issue
* 19:00 Krenair: reboot tools-static-12 tools-docker-registry-04 and tools-clushmaster-02 due to NFS stale issue
* 18:55 phamhi: reboot tools-worker-1028 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1027 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1026 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1025 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1023 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1022 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1021 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1020 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1019 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1018 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1017 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1016 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1015 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1014 due to nfs stale issue
* 18:23 phamhi: reboot tools-worker-1013 due to nfs stale issue
* 18:21 phamhi: reboot tools-worker-1012 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1011 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1010 due to nfs stale issue
* 18:08 phamhi: reboot tools-worker-1009 due to nfs stale issue
* 18:07 phamhi: reboot tools-worker-1008 due to nfs stale issue
* 17:58 phamhi: reboot tools-worker-1007 due to nfs stale issue
* 17:57 phamhi: reboot tools-worker-1006 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1005 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1004 due to nfs stale issue
* 17:43 phamhi: reboot tools-worker-1002.tools.eqiad.wmflabs due to nfs stale issue
* 17:35 phamhi: drained and uncordoned tools-worker-100[1-5]
* 17:32 bstorm_: reboot tools-sgewebgrid-lighttpd-0912
* 17:30 bstorm_: reboot tools-sgewebgrid-lighttpd-0923/24/08
* 17:01 bstorm_: rebooting tools-sgegrid-master and tools-sgegrid-shadow 😭
* 16:58 bstorm_: rebooting tools-sgewebgrid-lighttpd-0902/4/6/7/8/19
* 16:53 bstorm_: rebooting tools-sgewebgrid-generic-0902/4
* 16:50 bstorm_: rebooting tools-sgeexec-0915/18/19/23/26
* 16:49 bstorm_: rebooting tools-sgeexec-0901 and tools-sgeexec-0909/10/11
* 16:46 bd808: `sudo shutdown -r now` for tools-sgebastion-08
* 16:41 bstorm_: reboot tools-sgebastion-07
* 16:39 bd808: `sudo service nslcd restart` on tools-sgebastion-08
 
=== 2019-10-04 ===
* 21:43 bd808: `sudo exec-manage repool tools-sgeexec-0923.tools.eqiad.wmflabs`
* 21:26 bd808: Rebooting tools-sgeexec-0923 after lots of messing about with a broken update-initramfs build
* 20:35 bd808: Manually running `/usr/bin/python3 /usr/bin/unattended-upgrade` on tools-sgeexec-0923
* 20:33 bd808: Killed 2 /usr/bin/unattended-upgrade procs on tools-sgeexec-0923 that seemed stuck
* 13:33 arturo: remove /etc/init.d/rsyslog on tools-worker-XXXX nodes so the rsyslog deb prerm script doesn't prevent the package from being updated
 
=== 2019-10-03 ===
* 13:05 arturo: delete servers tools-sssd-sgeexec-test-[1,2], no longer required
 
=== 2019-09-27 ===
* 16:59 bd808: Set "profile::rsyslog::kafka_shipper::kafka_brokers: []" in tools-elastic prefix puppet
* 00:40 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0927
 
=== 2019-09-25 ===
* 19:08 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1021
 
=== 2019-09-23 ===
* 16:58 bstorm_: deployed tools-manifest 0.20 and restarted webservicemonitor
* 06:01 bd808: Restarted maintain-dbusers process on labstore1004. ([[phab:T233530|T233530]])
 
=== 2019-09-12 ===
* 20:48 phamhi: Deleted tools-puppetdb-01.tools as it is no longer in used
 
=== 2019-09-11 ===
* 13:30 jeh: restart tools-sgeexec-0912
 
=== 2019-09-09 ===
* 22:44 bstorm_: uncordoned tools-worker-1030 and tools-worker-1038
 
=== 2019-09-06 ===
* 15:11 bd808: `sudo kill -9 10635` on tools-k8s-master-01 ([[phab:T194859|T194859]])
 
=== 2019-09-05 ===
* 21:02 bd808: Enabled Puppet on tools-docker-registry-03 and forced puppet run ([[phab:T232135|T232135]])
* 18:13 bd808: Disabled Puppet on tools-docker-registry-03 to investigate docker-registry issue (no phab task yet)
 
=== 2019-09-01 ===
* 20:51 Reedy: `sudo service maintain-kubeusers restart` on tools-k8s-master-01
 
=== 2019-08-30 ===
* 16:54 phamhi: restart maintain-kuberusers service in tools-k8s-master-01
* 16:21 bstorm_: depooling tools-sgewebgrid-lighttpd-0923 to reboot -- high iowait likely from NFS mounts
 
=== 2019-08-29 ===
* 22:18 bd808: Finished building new stretch Docker images for Toolforge Kubernetes use
* 22:06 bd808: Starting process of building new stretch Docker images for Toolforge Kubernetes use
* 22:05 bd808: Jessie Docker image rebuild complete
* 21:31 bd808: Starting process of building new jessie Docker images for Toolforge Kubernetes use
 
=== 2019-08-27 ===
* 19:10 bd808: Restarted maintain-kubeusers after complaint on irc. It was stuck in limbo again
 
=== 2019-08-26 ===
* 21:48 bstorm_: repooled tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0902, tools-sgewebgrid-lighttpd-0903 and tools-sgeexec-0905
 
=== 2019-08-18 ===
* 08:11 arturo: restart maintain-kuberusers service in tools-k8s-master-01
 
=== 2019-08-17 ===
* 10:56 arturo: force-reboot tools-worker-1006. Is completely stuck
 
=== 2019-08-15 ===
* 15:32 jeh: upgraded jobutils debian package to 1.38 [[phab:T229551|T229551]]
* 09:22 arturo: restart maintain-kubeusers service in tools-k8s-master-01 because some tools were missing their namespaces
 
=== 2019-08-13 ===
* 22:00 bstorm_: truncated exim paniclog on tools-sgecron-01 because it was being spammy
* 13:41 jeh: Set icingia downtime for toolschecker labs showmount [[phab:T229448|T229448]]
 
=== 2019-08-12 ===
* 16:08 phamhi: updated prometheus-node-exporter from 0.14.0~git20170523-1 to 0.17.0+ds-3 in tools-worker-[1030-1040] nodes ([[phab:T230147|T230147]])
 
=== 2019-08-08 ===
* 19:26 jeh: restarting tools-sgewebgrid-lighttpd-0915 [[phab:T230157|T230157]]
 
=== 2019-08-07 ===
* 19:07 bd808: Disassociated SUL and Phabricator accounts from user Lophi ([[phab:T229713|T229713]])
 
=== 2019-08-06 ===
* 16:18 arturo: add phamhi as user/projectadmin ([[phab:T228942|T228942]]) and delete hpham
* 15:59 arturo: add hpham as user/projectadmin ([[phab:T228942|T228942]])
* 13:44 jeh: disabling puppet on tools-checker-03 while testing nginx timeouts [[phab:T221301|T221301]]
 
=== 2019-08-05 ===
* 22:49 bstorm_: launching tools-worker-1040
* 20:36 andrewbogott: rebooting oom tools-worker-1026
* 16:10 jeh: `tools-k8s-master-01: systemctl restart maintain-kubeusers` [[phab:T229846|T229846]]
* 09:39 arturo: `root@tools-checker-03:~# toolscheckerctl restart` again ([[phab:T229787|T229787]])
* 09:30 arturo: `root@tools-checker-03:~# toolscheckerctl restart` ([[phab:T229787|T229787]])
 
=== 2019-08-02 ===
* 14:00 andrewbogott_: rebooting tools-worker-1022 as it is unresponsive
 
=== 2019-07-31 ===
* 18:07 bstorm_: drained tools-worker-1015/05/03/17 to rebalance load
* 17:41 bstorm_: drained tools-worker-1025 and 1026 to rebalance load
* 17:32 bstorm_: drained tools-worker-1028 to rebalance load
* 17:29 bstorm_: drained tools-worker-1008 to rebalance load
* 17:23 bstorm_: drained tools-worker-1021 to rebalance load
* 17:17 bstorm_: drained tools-worker-1007 to rebalance load
* 17:07 bstorm_: drained tools-worker-1004 to rebalance load
* 16:27 andrewbogott: moving tools-static-12 to cloudvirt1018
* 15:33 bstorm_: [[phab:T228573|T228573]] spinning up 5 worker nodes for kubernetes cluster (tools-worker-1035-9)
 
=== 2019-07-27 ===
* 23:00 zhuyifei1999_: a past probably related ticket: [[phab:T194859|T194859]]
* 22:57 zhuyifei1999_: maintain-kubeusers seems stuck. Traceback: https://phabricator.wikimedia.org/P8812, core dump: /root/core.17898. Restarting
 
=== 2019-07-26 ===
* 17:39 bstorm_: restarted maintain-kubeusers because it was suspiciously tardy and quiet
* 17:14 bstorm_: drained tools-worker-1013.tools.eqiad.wmflabs to rebalance load
* 17:09 bstorm_: draining tools-worker-1020.tools.eqiad.wmflabs to rebalance load
* 16:32 bstorm_: created tools-worker-1034 - [[phab:T228573|T228573]]
* 15:57 bstorm_: created tools-worker-1032 and 1033 - [[phab:T228573|T228573]]
* 15:55 bstorm_: created tools-worker-1031 - [[phab:T228573|T228573]]
 
=== 2019-07-25 ===
* 22:01 bstorm_: [[phab:T228573|T228573]] created tools-worker-1030
* 21:22 jeh: rebooting tools-worker-1016 unresponsive
 
=== 2019-07-24 ===
* 10:14 arturo: reallocating tools-puppetmaster-01 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
* 10:12 arturo: reallocating tools-docker-registry-04 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
 
=== 2019-07-22 ===
* 18:39 bstorm_: repooled tools-sgeexec-0905 after reboot
* 18:33 bstorm_: depooled tools-sgeexec-0905 because it's acting kind of weird and not responding to prometheus
* 18:32 bstorm_: repooled tools-sgewebgrid-lighttpd-0902 after restarting the grid-exec service
* 18:28 bstorm_: depooled tools-sgewebgrid-lighttpd-0902 to find out why it is behaving weird
* 17:55 bstorm_: draining tools-worker-1023 since it is having issues
* 17:38 bstorm_: Adding the prometheus servers to the ferm rules via wikitech hiera for kubelet stats [[phab:T228573|T228573]]
 
=== 2019-07-20 ===
* 19:52 andrewbogott: rebooting tools-worker-1023
 
=== 2019-07-17 ===
* 20:23 andrewbogott: migrating tools-sgegrid-shadow to cloudvirt1014
 
=== 2019-07-15 ===
* 14:50 bstorm_: cleared error state from tools-sgeexec-0911 which went offline after error from job {{Gerrit|5190035}}
 
=== 2019-06-25 ===
* 09:30 arturo: detected puppet issue in all VMs: [[phab:T226480|T226480]]
 
=== 2019-06-24 ===
* 17:42 andrewbogott: moving tools-sgeexec-0905 to cloudvirt1015
 
=== 2019-06-17 ===
* 14:07 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1015
* 13:59 andrewbogott: moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: [[phab:T220853|T220853]] )
 
=== 2019-06-11 ===
* 18:03 bstorm_: deleted anomalous kubernetes node tools-worker-1019.eqiad.wmflabs
 
=== 2019-06-05 ===
* 18:33 andrewbogott: repooled  tools-sgeexec-0921 and tools-sgeexec-0929
* 18:16 andrewbogott: depooling and moving tools-sgeexec-0921 and tools-sgeexec-0929
 
=== 2019-05-30 ===
* 13:01 arturo: uncordon/repool tools-worker-1001/2/3. They should be fine now. I'm only leaving 1029 cordoned for testing purposes
* 13:01 arturo: reboot tools-woker-1003 to cleanup sssd config and let nslcd/nscd start freshly
* 12:47 arturo: reboot tools-woker-1002 to cleanup sssd config and let nslcd/nscd start freshly
* 12:42 arturo: reboot tools-woker-1001 to cleanup sssd config and let nslcd/nscd start freshly
* 12:35 arturo: enable puppet in tools-worker nodes
* 12:29 arturo: switch hiera setting back to classic/sudoldap for tools-worker because [[phab:T224651|T224651]] ([[phab:T224558|T224558]])
* 12:25 arturo: cordon/drain tools-worker-1002 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:23 arturo: cordon/drain tools-worker-1001 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:22 arturo: cordon/drain tools-worker-1029 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:20 arturo: cordon/drain tools-worker-1003 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 11:59 arturo: [[phab:T224558|T224558]] repool tools-worker-1003 (using sssd/sudo now!)
* 11:23 arturo: [[phab:T224558|T224558]] depool tools-worker-1003
* 10:48 arturo: [[phab:T224558|T224558]] drop/build a VM for tools-worker-1002. It didn't like the sssd/sudo change :-(
* 10:33 arturo: [[phab:T224558|T224558]] switch tools-worker-1002 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:28 arturo: [[phab:T224558|T224558]] use hiera config in prefix tools-worker for sssd/sudo
* 10:27 arturo: [[phab:T224558|T224558]] switch tools-worker-1001 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:09 arturo: [[phab:T224558|T224558]] disable puppet in all tools-worker- nodes
* 10:01 arturo: [[phab:T224558|T224558]] add tools-worker-1029 to the nodes pool of k8s
* 09:58 arturo: [[phab:T224558|T224558]] reboot tools-worker-1029 after puppet changes for sssd/sudo in jessie
 
=== 2019-05-29 ===
* 11:13 arturo: briefly tested some sssd config changes in tools-sgebastion-09
* 10:13 arturo: enroll the tools-worker-1029 VM into toolforge k8s, but leave it cordoned for sssd testing purposes ([[phab:T221225|T221225]])
* 10:12 arturo: re-create the tools-worker-1001 VM, already enrolled into toolforge k8s
* 09:34 arturo: delete tools-worker-1001, it was totally malfunctioning
 
=== 2019-05-28 ===
* 18:15 arturo: [[phab:T221225|T221225]] for the record, tools-worker-1001 is not working after trying with sssd
* 18:13 arturo: [[phab:T221225|T221225]] created tools-worker-1029 to test sssd/sudo stuff
* 17:49 arturo: [[phab:T221225|T221225]] repool tools-worker-1002 (using nscd/nslcd and sudoldap)
* 17:44 arturo: [[phab:T221225|T221225]] back to classic/ldap hiera config in the tools-worker puppet prefix
* 17:35 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001 again
* 17:27 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001
* 17:12 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1002
* 17:09 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1001
* 17:08 arturo: [[phab:T221225|T221225]] switch to sssd/sudo in puppet prefix for tools-worker
* 13:04 arturo: [[phab:T221225|T221225]] depool and rebooted tools-worker-1001 in preparation for sssd migration
* 12:39 arturo: [[phab:T221225|T221225]] disable puppet in all tools-worker nodes in preparation for sssd
* 12:32 arturo: drop the tools-bastion puppet prefix, unused
* 12:31 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-checker prefix, and reboot tools-checker-03
* 12:27 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-docker-registry prefix, and reboot tools-docker-registry-[03-04]
* 12:16 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-sgebastion prefix, and reboot tools-sgebastion-07/08
* 11:26 arturo: merged change to the sudo module to allow sssd transition
 
=== 2019-05-27 ===
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90% (on tools-sgebastion-08)
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90%
 
=== 2019-05-21 ===
* 12:35 arturo: [[phab:T223992|T223992]] rebooting tools-redis-1002
 
=== 2019-05-20 ===
* 11:25 arturo: [[phab:T223332|T223332]] enable puppet agent in tools-k8s-master and tools-docker-registry nodes and deploy new SSL cert
* 10:53 arturo: [[phab:T223332|T223332]] disable puppet agent in tools-k8s-master and tools-docker-registry nodes
 
=== 2019-05-18 ===
* 11:13 chicocvenancio: PAWS update helm chart to point to new singleuser image ([[phab:T217908|T217908]])
* 09:06 bd808: Rebuilding all stretch docker images to pick up toollabs-webservice 0.45
 
=== 2019-05-17 ===
* 17:36 bd808: Rebuilding all docker images to pick up toollabs-webservice 0.45
* 17:35 bd808: Deployed toollabs-webservice 0.45 (python 3.5 and nodejs 10 containers)
 
=== 2019-05-16 ===
* 11:22 chicocvenancio: PAWS: restart hub to get new configured announcement
* 11:05 chicocvenancio: PAWS: change confimap to reference WMHACK 2019 as busiest time
 
=== 2019-05-15 ===
* 16:20 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-0921 and -0929
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0921 and move to cloudvirt1014
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and move to cloudvirt1014
* 12:29 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-09[37,39]
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0937 and move to cloudvirt1008
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0939 and move to cloudvirt1007
* 11:34 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0940
* 11:20 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0940 and move to cloudvirt1006
* 11:11 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0941
* 10:46 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0941 and move to cloudvirt1005
* 09:44 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0901
* 09:00 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0901 and reallocate to cloudvirt1004
 
=== 2019-05-14 ===
* 17:12 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0920
* 16:37 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and reallocate to cloudvirt1003
* 16:36 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0911
* 15:56 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0911 and reallocate to cloudvirt1003
* 15:52 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0909
* 15:24 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0909 and reallocate to cloudvirt1002
* 15:24 arturo: [[phab:T223148|T223148]] last SAL entry is bogus, please ignore (depool tools-worker-1009)
* 15:23 arturo: [[phab:T223148|T223148]] depool tools-worker-1009
* 15:13 arturo: [[phab:T223148|T223148]] repool tools-worker-1023
* 13:16 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0942
* 13:03 arturo: [[phab:T223148|T223148]] repool tools-sgewebgrid-generic-0904
* 12:58 arturo: [[phab:T223148|T223148]] reallocating tools-worker-1023 to cloudvirt1001
* 12:56 arturo: [[phab:T223148|T223148]] depool tools-worker-1023
* 12:52 arturo: [[phab:T223148|T223148]] reallocating tools-sgeexec-0942 to cloudvirt1001
* 12:50 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0942
* 12:49 arturo: [[phab:T223148|T223148]] reallocating tools-sgewebgrid-generic-0904 to cloudvirt1001
* 12:43 arturo: [[phab:T223148|T223148]] depool tools-sgewebgrid-generic-0904
 
=== 2019-05-13 ===
* 08:15 zhuyifei1999_: `truncate -s 0 /var/log/exim4/paniclog` on tools-sgecron-01.tools.eqiad.wmflabs & tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs
 
=== 2019-05-07 ===
* 14:38 arturo: [[phab:T222718|T222718]] uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned
* 14:31 arturo: [[phab:T222718|T222718]] reboot tools-worker-1009 and 1022 after being drained
* 14:28 arturo: k8s drain tools-worker-1009 and 1022
* 11:46 arturo: [[phab:T219362|T219362]] enable puppet in tools-redis servers and use the new puppet role
* 11:33 arturo: [[phab:T219362|T219362]] disable puppet in tools-reds servers for puppet code cleanup
* 11:12 arturo: [[phab:T219362|T219362]] drop the `tools-services` puppet prefix (we are actually using `tools-sgeservices`)
* 11:10 arturo: [[phab:T219362|T219362]] enable puppet in tools-static servers and use new puppet role
* 11:01 arturo: [[phab:T219362|T219362]] disable puppet in tools-static servers for puppet code cleanup
* 10:16 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-lighttpd` puppet prefix
* 10:14 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-generic` puppet prefix
* 10:06 arturo: [[phab:T219362|T219362]] drop the `tools-exec-1` puppet prefix
 
=== 2019-05-06 ===
* 11:34 arturo: [[phab:T221225|T221225]] reenable puppet
* 10:53 arturo: [[phab:T221225|T221225]] disable puppet in all toolforge servers for testing sssd patch (puppetmaster livehack)
 
=== 2019-05-03 ===
* 09:43 arturo: fixed puppet in tools-puppetdb-01 too
* 09:39 arturo: puppet should be now fine across toolforge (except tools-puppetdb-01 which is WIP I think)
* 09:37 arturo: fix puppet in tools-elastic-03, archived jessie repos, weird rsyslog-kafka package situation
* 09:33 arturo: fix puppet in tools-elastic-02, archived jessie repos, weird rsyslog-kafka package situation
* 09:24 arturo: fix puppet in tools-elastic-01, archived jessie repos, weird rsyslog-kafka package situation
* 09:18 arturo: solve a weird apt situation in tools-puppetmaster-01 regarding the rsyslog-kafka package (puppet agent was failing)
* 09:16 arturo: solve a weird apt situation in tools-worker-1028 regarding the rsyslog-kafka package
 
=== 2019-04-30 ===
* 12:50 arturo: enable puppet in all servers [[phab:T221225|T221225]]
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd ([[phab:T221225|T221225]])
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd
* 11:07 arturo: [[phab:T221225|T221225]] disable puppet in toolforge
* 10:56 arturo: [[phab:T221225|T221225]] create tools-sgebastion-0test for more sssd tests
 
=== 2019-04-29 ===
* 11:22 arturo: [[phab:T221225|T221225]] re-enable puppet agent in all toolforge servers
* 10:27 arturo: [[phab:T221225|T221225]] reboot tool-sgebastion-09 for testing sssd
* 10:21 arturo: disable puppet in all servers to livehack tools-puppetmaster-01 to test [[phab:T221225|T221225]]
* 08:29 arturo: cleanup disk in tools-sgebastion-09, was full of debug logs and unused apt packages
 
=== 2019-04-26 ===
* 12:20 andrewbogott: rescheduling every pod everywhere
* 12:18 andrewbogott: rescheduling all pods on tools-worker-1023.tools.eqiad.wmflabs
 
=== 2019-04-25 ===
* 12:49 arturo: [[phab:T221225|T221225]] using `profile::ldap::client::labs::client_stack: sssd` in horizon for tools-sgebastion-09 (testing)
* 11:43 arturo: [[phab:T221793|T221793]] removing prometheus crontab and letting puppet agent re-create it again to resolve staleness
 
=== 2019-04-24 ===
* 12:54 arturo: puppet broken, fixing right now
* 09:18 arturo: [[phab:T221225|T221225]] reallocating tools-sgebastion-09 to cloudvirt1008
 
=== 2019-04-23 ===
* 15:26 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-08 to cleanup sssd
* 15:19 arturo: [[phab:T221225|T221225]] creating tools-sgebastion-09 for testing sssd stuff
* 13:06 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix, again. Rollback again.
* 12:57 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix, try again with sssd in the bastions, reboot them
* 10:28 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix
* 10:27 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-07 to clean sssd confiuration
* 10:16 arturo: [[phab:T221225|T221225]] disable puppet in tools-sgebastion-08 for sssd testing
* 09:49 arturo: [[phab:T221225|T221225]] run puppet agent in the bastions and reboot them with sssd
* 09:43 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix
* 09:41 arturo: [[phab:T221225|T221225]] disable puppet agent in the bastions
 
=== 2019-04-17 ===
* 12:09 arturo: [[phab:T221225|T221225]] rebooting bastions to clean sssd. We are back to nscd/nslcd until we figure out what's wrong here
* 11:59 arturo: [[phab:T221205|T221205]] sssd was deployed successfully into all webgrid nodes
* 11:39 arturo: deploy sssd to tools-sge-services-03/04 (includes reboot)
* 11:31 arturo: reboot bastions for sssd deployment
* 11:30 arturo: deploy sssd to bastions
* 11:24 arturo: disable puppet in bastions to deploy sssd
* 09:52 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0915 requires some manual intervention because issues in the dpkg database prevents deleting nscd/nslcd packages
* 09:45 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0913 requires some manual intervention because unconfigured packages prevents a clean puppet agent run
* 09:12 arturo: [[phab:T221205|T221205]] start deploying sssd to sgewebgrid nodes
* 09:00 arturo: [[phab:T221205|T221205]] add `profile::ldap::client::labs::client_stack: sssd` in horizon for the puppet prefixes `tools-sgewebgrid-lighttpd` and `tools-sgewebgrid-generic`
* 08:57 arturo: [[phab:T221205|T221205]] disable puppet in all tools-sgewebgrid-* nodes
 
=== 2019-04-16 ===
* 20:49 chicocvenancio: change paws announcement in configmap hub-config back to a welcome message
* 17:15 chicocvenancio: add paws outage announcement in configmap hub-config
* 17:00 andrewbogott: moving tools-k8s-master-01 to eqiad1-r
 
=== 2019-04-15 ===
* 18:50 andrewbogott: moving tools-elastic-01 to cloudvirt1008 to make spreadcheck happy
* 15:01 andrewbogott: moving tools-redis-1001 to eqiad1-r
 
=== 2019-04-14 ===
* 16:23 andrewbogott: moved all tools-worker nodes off of cloudvirt1015 and uncordoned them
 
=== 2019-04-13 ===
* 21:09 bstorm_: Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for [[phab:T220853|T220853]]
* 20:36 bstorm_: moving tools-elastic-02 to cloudvirt1009 for [[phab:T220853|T220853]]
* 19:58 bstorm_: started migrating tools-k8s-etcd-03 to cloudvirt1012 [[phab:T220853|T220853]]
* 19:51 bstorm_: started migrating tools-flannel-etcd-02 to cloudvirt1013 [[phab:T220853|T220853]]
 
=== 2019-04-11 ===
* 22:38 andrewbogott: moving tools-paws-worker-1005 to cloudvirt1009 to make spreadcheck happier
* 21:49 bd808: Re-enabled puppet on tools-elastic-02 and forced puppet run
* 21:44 andrewbogott: moving tools-mail-02 to eqiad1-r
* 20:56 andrewbogott: shutting down tools-logs-02 — seems unused
* 19:44 bd808: Disabled puppet on tools-elastic-02 and set to 1-node master
* 19:34 andrewbogott: moving tools-puppetmaster-01 to eqiad1-r
* 15:40 andrewbogott: moving tools-redis-1002  to eqiad1-r
* 13:52 andrewbogott: moving tools-prometheus-01 and tools-elastic-01 to eqiad1-r
* 12:01 arturo: [[phab:T151704|T151704]] deploying oidentd
* 11:54 arturo: disable puppet in all hosts to deploy oidentd
* 02:33 andrewbogott: tools-paws-worker-1005,  tools-paws-worker-1006 to eqiad1-r
* 00:03 andrewbogott: tools-paws-worker-1002,  tools-paws-worker-1003 to eqiad1-r
 
=== 2019-04-10 ===
* 23:58 andrewbogott: moving tools-clushmaster-02, tools-elastic-03 and tools-paws-worker-1001 to eqiad1-r
* 18:52 bstorm_: depooled and rebooted tools-sgeexec-0929 because systemd was in a weird state
* 18:46 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0913 because high load was caused by ancient lsof processes
* 14:49 bstorm_: cleared E state from 5 queues
* 13:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0906
* 12:31 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0926
* 12:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0925
* 12:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0901
* 11:55 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0924
* 11:47 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0921
* 11:23 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0940
* 11:03 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0928
* 10:49 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0923
* 10:43 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0915
* 10:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0935
* 10:19 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0914
* 10:02 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0907
* 09:41 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0918
* 09:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0932
* 09:26 arturo: [[phab:T218216|T218216]] hard reboot tools-sgeexec-0932
* 09:04 arturo: [[phab:T218216|T218216]] add `profile::ldap::client::labs::client_stack: sssd` to prefix puppet for sge-exec nodes
* 09:03 arturo: [[phab:T218216|T218216]] do a controlled rollover of sssd, depooling sgeexec nodes, reboot and repool
* 08:39 arturo: [[phab:T218216|T218216]] disable puppet in all tools-sgeexec-XXXX nodes for controlled sssd rollout
* 00:32 andrewbogott: migrating tools-worker-1022, 1023, 1025, 1026 to eqiad1-r
 
=== 2019-04-09 ===
* 22:04 bstorm_: added the new region on port 80 to the elasticsearch security group for stashbot
* 21:16 andrewbogott: moving tools-flannel-etcd-03 to eqiad1-r
* 20:43 andrewbogott: moving tools-worker-1018, 1019, 1020, 1021 to eqiad1-r
* 20:04 andrewbogott: moving tools-k8s-etcd-03 to eqiad1-r
* 19:54 andrewbogott: moving tools-flannel-etcd-02 to eqiad1-r
* 18:36 andrewbogott: moving tools-worker-1016, tools-worker-1017 to eqiad1-r
* 18:05 andrewbogott: migrating tools-k8s-etcd-02 to eqiad1-r
* 18:00 andrewbogott: migrating tools-flannel-etcd-01 to eqiad1-r
* 17:36 andrewbogott: moving tools-worker-1014, tools-worker-1015 to eqiad1-r
* 17:05 andrewbogott: migrating  tools-k8s-etcd-01 to eqiad1-r
* 15:56 andrewbogott: moving tools-worker-1012, tools-worker-1013 to eqiad1-r
* 14:56 bstorm_: cleared 4 queues on gridengine of E status (ldap again)
* 14:07 andrewbogott: moving tools-worker-1010, tools-worker-1011, tools-worker-1001 to eqiad1-r
* 03:48 andrewbogott: moving tools-worker-1008 and tools-worker-1009 to eqiad1-r
* 02:07 bstorm_: reloaded ferm on tools-flannel-etcd-0[1-3] to get  the k8s node moves to register
 
=== 2019-04-08 ===
* 22:36 andrewbogott: moving tools-worker-1006 and tools-worker-1007 to eqiad1-r
* 20:03 andrewbogott: moving tools-worker-1003 and tools-worker-1004 to eqiad1-r
 
=== 2019-04-07 ===
* 16:54 zhuyifei1999_: tools-sgeexec-0928 unresponsive since around 22 UTC. No data on Graphite. Can't ssh in even as root. Hard rebooting via Horizon
* 01:06 bstorm_: cleared E state from 6 queues
 
=== 2019-04-05 ===
* 15:44 bstorm_: cleared E state from two exec queues
 
=== 2019-04-04 ===
* 21:21 bd808: Uncordoned tools-worker-1013.tools.eqiad.wmflabs after reboot and forced puppet run
* 20:53 bd808: Rebooting tools-worker-1013
* 20:50 bd808: Draining tools-worker-1013.tools.eqiad.wmflabs
* 20:29 bd808: Released floating IP and deleted instance tools-checker-01 via Horizon
* 20:28 bd808: Shutdown tools-checker-01 via Horizon
* 20:17 bd808: Repooled tools-webgrid-lighttpd-0906 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:13 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0906 via Horizon
* 20:09 bd808: Repooled tools-webgrid-lighttpd-0912 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:05 bd808: Depooled and rebooted tools-sgewebgrid-lighttpd-0912
* 20:05 bstorm_: rebooted tools-webgrid-lighttpd-0912
* 20:03 bstorm_: depooled  tools-webgrid-lighttpd-0912
* 19:59 bstorm_: depooling and rebooting tools-webgrid-lighttpd-0906
* 19:43 bd808: Repooled tools-sgewebgrid-lighttpd-0926 after reboot, apt-get dist-update, and forced puppet run
* 19:36 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0926 via Horizon
* 19:30 bd808: Rebooting tools-sgewebgrid-lighttpd-0926
* 19:28 bd808: Depooled tools-sgewebgrid-lighttpd-0926
* 19:13 bstorm_: cleared E state from 7 queues
* 17:32 andrewbogott: moving tools-static-12 to cloudvirt1023 to keep the two static nodes off the same host
 
=== 2019-04-03 ===
* 11:22 arturo: puppet breakage in due to me introducing openstack-mitaka-jessie repo by mistake. Cleaning up already
 
=== 2019-04-02 ===
* 12:11 arturo: icinga downtime toolschecker for 1 month [[phab:T219243|T219243]]
* 03:55 bd808: Added etcd service group to tools-k8s-etcd-* ([[phab:T219243|T219243]])
 
=== 2019-04-01 ===
* 19:44 bd808: Deleted tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 19:43 bd808: Shutdown tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 16:53 bstorm_: cleared E state on 6 grid queues
* 14:54 andrewbogott: moving tools-static-12 to eqiad1-r (for real this time maybe)
 
=== 2019-03-29 ===
* 21:13 bstorm_: depooled tools-sgewebgrid-generic-0903 because of some stuck jobs and odd load characteristics
* 21:09 bd808: Updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ on tools-puppetmaster-01 ([[phab:T219243|T219243]])
* 20:48 bd808: Using root console to fix broken initial puppet run on tools-checker-03.
* 20:32 bd808: Creating tools-checker-03 with role::wmcs::toolforge::checker ([[phab:T219243|T219243]])
* 20:24 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ to tools-puppetmaster-01 for testing ([[phab:T219243|T219243]])
* 20:22 bd808: Disabled puppet on tools-checker-0{1,2} to make testing new role::wmcs::toolforge::checker easier ([[phab:T219243|T219243]])
* 17:25 bd808: Cleared the "Eqw" state of 44 jobs with `qstat -u '*' {{!}} grep Eqw {{!}} awk '{print $1;}' {{!}} xargs -L1 sudo qmod -cj` on tools-sgegrid-master
* 17:16 andrewbogott: aborted move of tools-static-12; will wait until tomorrow and give DNS caches more time to update
* 17:11 bd808: Restarted nginx on tools-static-13
* 16:53 andrewbogott: moving tools-static-12 to eqiad1-r
* 16:49 bstorm_: cleared E state from 21 queues
* 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
* 13:54 andrewbogott: moving tools-static-13 to eqiad1-r
 
=== 2019-03-28 ===
* 01:00 bstorm_: cleared error states from two queues
* 00:23 bstorm_: [[phab:T216060|T216060]] created tools-sgewebgrid-generic-0901...again!
 
=== 2019-03-27 ===
* 23:35 bstorm_: rebooted tools-paws-master-01 for NFS issue [[phab:T219460|T219460]]
* 14:45 bstorm_: cleared several "E" state queues
* 12:26 gtirloni: truncated exim4/paniclog on tools-sgewebgrid-lighttpd-0921
* 12:25 gtirloni: truncated exim4/paniclog on tools-sgecron-01
* 12:15 arturo: [[phab:T218126|T218126]] `aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2'` (and 1)
 
=== 2019-03-26 ===
* 22:00 gtirloni: downtimed toolschecker
* 17:31 arturo: [[phab:T218126|T218126]] create VM instances tools-sssd-sgeexec-test-[12]
* 00:26 bd808: Deleted DNS record for login-trusty.tools.wmflabs.org
* 00:26 bd808: Deleted DNS record for trusty-dev.tools.wmflabs.org
 
=== 2019-03-25 ===
* 21:21 bd808: All Trusty grid engine hosts shutdown and deleted ([[phab:T217152|T217152]])
* {{safesubst:SAL entry|1=21:19 bd808: Deleted tools-grid-{master,shadow}  (T217152)}}
* 21:18 bd808: Deleted tools-webgrid-lighttpd-14*  ([[phab:T217152|T217152]])
* 20:55 bstorm_: reboot tools-sgewebgrid-generic-0903 to clear up some issues
* 20:52 bstorm_: rebooting tools-package-builder-02 due to lots of hung /usr/bin/lsof +c 15 -nXd DEL processes
* 20:51 bd808: Deleted tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-143* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-142* ([[phab:T217152|T217152]])
* 20:48 bd808: Deleted tools-exec-141* ([[phab:T217152|T217152]])
* 20:47 bd808: Deleted tools-exec-140* ([[phab:T217152|T217152]])
* 20:43 bd808: Deleted  tools-cron-01 ([[phab:T217152|T217152]])
* 20:42 bd808: Deleted tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 20:35 bstorm_: rebooted tools-worker-1025 and tools-worker-1021
* 19:59 bd808: Shutdown tools-exec-143* ([[phab:T217152|T217152]])
* 19:51 bd808: Shutdown tools-exec-142* ([[phab:T217152|T217152]])
* 19:47 bstorm_: depooling tools-worker-1025.tools.eqiad.wmflabs because it's not responding and showing insane load
* 19:33 bd808: Shutdown tools-exec-141* ([[phab:T217152|T217152]])
* 19:31 bd808: Shutdown tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 19:19 bd808: Shutdown tools-exec-140* ([[phab:T217152|T217152]])
* 19:12 bd808: Shutdown tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 19:11 bd808: Shutdown tools-webgrid-lighttpd-14* ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-master ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-shadow ([[phab:T217152|T217152]])
* 18:49 bd808: All jobs still running on the Trusty job grid force deleted.
* 18:46 bd808: All Trusty job grid queues marked as disabled. This should stop all new Trusty job submissions.
* 18:43 arturo: icinga downtime tools-checker for 24h due to trusty grid shutdown
* 18:39 bd808: Shutdown tools-cron-01.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 15:27 bd808: Copied all crontab files still on tools-cron-01 to tool's $HOME/crontab.trusty.save
* 02:34 bd808: Disassociated floating IPs and deleted shutdown Trusty grid nodes tools-exec-14{33,34,35,36,37,38,39,40,41,42} ([[phab:T217152|T217152]])
* 02:26 bd808: Deleted shutdown Trusty grid nodes tools-webgrid-lighttpd-14{20,21,22,24,25,26,27,28} ([[phab:T217152|T217152]])
 
=== 2019-03-22 ===
* 17:16 andrewbogott: switching all instances to use ldap-ro.eqiad.wikimedia.org as both primary and secondary ldap server
* 16:12 bstorm_: cleared errored out stretch grid queues
* 15:56 bd808: Rebooting tools-static-12
* 03:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted 15 other nodes.  Entire stretch grid is in a good state for now.
* 02:31 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening
* 02:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0924
* 00:39 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0902
 
=== 2019-03-21 ===
* 23:28 bstorm_: [[phab:T217280|T217280]] depooled, reloaded and repooled tools-sgeexec-0938
* 21:53 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling
* 21:51 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling
* 21:26 bstorm_: [[phab:T217280|T217280]] cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related
 
=== 2019-03-18 ===
* 18:43 bd808: Rebooting tools-static-12
* 18:42 chicocvenancio: PAWS: 3 nodes still in not ready state, `worker-10(01{{!}}07{{!}}10)` all else working
* 18:41 chicocvenancio: PAWS: deleting pods stuck in Unknown state with ` --grace-period=0 --force`
* 18:40 andrewbogott: rebooting tools-static-13 in hopes of fixing some nfs mounts
* 18:25 chicocvenancio: removing postStart hook for PWB update and restarting hub while gerrit.wikimedia.com is down
 
=== 2019-03-17 ===
* 23:41 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497210/ as a quick fix for [[phab:T218494|T218494]]
* 22:30 bd808: Investigating strange system state on tools-bastion-03.
* 17:48 bstorm_: [[phab:T218514|T218514]] rebooting tools-worker-1009 and 1012
* 17:46 bstorm_: depooling tools-worker-1009 and tools-worker-1012 for [[phab:T218514|T218514]]
* 17:13 bstorm_: depooled and rebooting tools-worker-1018
* 15:09 andrewbogott: running 'killall dpkg and dpkg --configure -a' on all nodes to try to work around a race with initramfs
 
=== 2019-03-16 ===
* 22:34 bstorm_: clearing errored out queues again
 
=== 2019-03-15 ===
* 21:08 bstorm_: cleared error state on several queues [[phab:T217280|T217280]]
* 15:58 gtirloni: rebooted tools-clushmaster-02
* 14:40 mutante: tools-sgebastion-07 - dpkg-reconfigure locales and adding Korean ko_KR.EUC-KR - [[phab:T130532|T130532]]
* 14:32 mutante: tools-sgebastion-07 - generating locales for user request in [[phab:T130532|T130532]]
 
=== 2019-03-14 ===
* 23:52 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{21,22,23,24,25,26,27,28,29,30,31,32} ([[phab:T217152|T217152]])
* 23:28 bd808: Deleted tools-bastion-05 ([[phab:T217152|T217152]])
* 22:30 bd808: Removed obsolete submit hosts from Trusty grid config
* 22:20 bd808: Removed tools-webgrid-lighttpd-142{0,1,2,5} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 22:10 bd808: Depooled tools-webgrid-lighttpd-142{0,1,2,5} ([[phab:T217152|T217152]])
* 21:55 bd808: Removed submit host flag from tools-bastion-05.tools.eqiad.wmflabs, removed floating ip, and shutdown instance via horizon ([[phab:T217152|T217152]])
* 21:48 bd808: Removed tools-exec-14{33,34,35,36,37,38,39,40,41,42} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 21:38 gtirloni: rebooted tools-sgewebgrid-generic-0904 ([[phab:T218341|T218341]])
* 21:32 gtirloni: rebooted tools-exec-1020 ([[phab:T218341|T218341]])
* 21:23 gtirloni: rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 ([[phab:T218341|T218341]])
* 21:19 bd808: Killed jobs still running on tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs 2 weeks after being depooled ([[phab:T217152|T217152]])
* 20:58 bd808: Repooled tools-sgeexec-0941 following reboot
* 20:57 bd808: Hard reboot of tools-sgeexec-0941 via horizon
* 20:54 bd808: Depooled and rebooted tools-sgeexec-0941.tools.eqiad.wmflabs
* 20:53 bd808: Repooled tools-sgeexec-0917 following reboot
* 20:52 bd808: Hard reboot of tools-sgeexec-0917 via horizon
* 20:47 bd808: Depooled and rebooted tools-sgeexec-0917
* 20:44 bd808: Repooled tools-sgeexec-0908 after reboot
* 20:36 bd808: depooled and rebooted tools-sgeexec-0908
* 19:08 gtirloni: rebooted tools-worker-1028 ([[phab:T218341|T218341]])
* 19:08 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914 ([[phab:T218341|T218341]])
* 19:07 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914
* 18:13 gtirloni: drained tools-worker-1028 for reboot (processes in D state)
 
=== 2019-03-13 ===
* 23:30 bd808: Rebuilding stretch Kubernetes images
* 22:55 bd808: Rebuilding jessie Kubernetes images
* 17:11 bstorm_: specifically rebooted SGE cron server tools-sgecron-01
* 17:10 bstorm_: rebooted cron server
* 16:10 bd808: Updated DNS for dev.tools.wmflabs.org to point to Stretch secondary bastion. This was missed on 2019-03-07
* 12:33 arturo: reboot tools-sgebastion-08 ([[phab:T215154|T215154]])
* 12:17 arturo: reboot tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:53 arturo: enable puppet in tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:20 arturo: disable puppet in tools-sgebastion-07 for testing [[phab:T215154|T215154]]
* 05:07 bstorm_: re-enabled puppet for tools-sgebastion-07
* 04:59 bstorm_: disabled puppet for a little bit on tools-bastion-07
* 00:22 bd808: Raise web-memlimit for isbn tool to 6G for tomcat8 ([[phab:T217406|T217406]])
 
=== 2019-03-11 ===
* 15:53 bd808: Manually started `service gridengine-master` on tools-sgegrid-master after reboot ([[phab:T218038|T218038]])
* 15:47 bd808: Hard reboot of tools-sgegrid-master via Horizon UI ([[phab:T218038|T218038]])
* 15:42 bd808: Rebooting tools-sgegrid-master ([[phab:T218038|T218038]])
* 14:49 gtirloni: deleted tools-webgrid-lighttpd-1419
* 00:53 bd808: Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
 
=== 2019-03-10 ===
* 22:36 gtirloni: increased nscd group TTL from 60 to 300sec
 
=== 2019-03-08 ===
* 19:48 andrewbogott: repooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 19:21 andrewbogott: depooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 17:49 bd808: Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
* 00:30 bd808: DNS record created for trusty-dev.tools.wmflabs.org (Trusty secondary bastion)
 
=== 2019-03-07 ===
* 23:31 bd808: Updated DNS to point login.tools.wmflabs.org at 185.15.56.48 (Stretch bastion)
* 04:15 bd808: Killed 3 orphan processes on Trusty grid
* 04:01 bd808: Cleared error state on a large number of Stretch grid queues which had been disabled by LDAP and/or NFS hiccups ([[phab:T217280|T217280]])
* 00:49 zhuyifei1999_: clushed misctools 1.37 upgrade on @bastion,@cron,@bastion-stretch [[phab:T217406|T217406]]
* 00:38 zhuyifei1999_: published misctools 1.37 [[phab:T217406|T217406]]
* 00:34 zhuyifei1999_: begin building misctools 1.37 using debuild [[phab:T217406|T217406]]
 
=== 2019-03-06 ===
* 13:57 gtirloni: fixed SSH warnings in tools-clushmaster-02
 
=== 2019-03-04 ===
* 19:07 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org for [[phab:T217473|T217473]]
* {{safesubst:SAL entry|1=14:05 gtirloni: rebooted tools-docker-registry-{03,04}, tools-puppetmaster-02 and tools-puppetdb-01 (load avg >45, not accessible)}}
 
=== 2019-03-03 ===
* 20:54 andrewbogott: cleaning out /tmp on tools-exec-1412
 
=== 2019-02-28 ===
* 19:36 zhuyifei1999_: built with debuild instead [[phab:T217297|T217297]]
* 19:08 zhuyifei1999_: test failures during build, see ticket
* 18:55 zhuyifei1999_: start building jobutils 1.36 [[phab:T217297|T217297]]
 
=== 2019-02-27 ===
* 20:41 andrewbogott: restarting nginx on tools-checker-01
* 19:34 andrewbogott: uncordoning tools-worker-1028, 1002 and 1005, now in eqiad1-r
* 16:20 zhuyifei1999_: regenerating k8s creds for tools.whichsub & tools.permission-denied-test [[phab:T176027|T176027]]
* 15:40 andrewbogott: moving tools-worker-1002, 1005, 1028 to eqiad1-r
* 01:36 bd808: Shutdown tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs via horizon ([[phab:T217152|T217152]])
* 01:29 bd808: Depooled tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 01:26 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
 
=== 2019-02-26 ===
* 20:51 gtirloni: reboot tools-package-builder-02 (unresponsive)
* 19:01 gtirloni: pushed updated docker images
* 17:30 andrewbogott: draining and cordoning tools-worker-1027 for a region migration test
 
=== 2019-02-25 ===
* 23:20 bstorm_: Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for [[phab:T217066|T217066]]
* 21:41 andrewbogott: depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test [[phab:T217066|T217066]]
* 13:11 chicocvenancio: PAWS:  Stopped AABot notebook pod [[phab:T217010|T217010]]
* 12:54 chicocvenancio: PAWS:  Restarted Criscod notebook pod [[phab:T217010|T217010]]
* 12:21 chicocvenancio: PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod [[phab:T217010|T217010]]
* 09:50 gtirloni: rebooted tools-sgeexec-09{16,22,40} ([[phab:T216988|T216988]])
* 09:41 gtirloni: rebooted tools-sgeexec-09{16,22,40}
* 08:37 zhuyifei1999_: uncordon tools-worker-1015.tools.eqiad.wmflabs
* 08:34 legoktm: hard rebooted tools-worker-1015 via horizon
* 07:48 zhuyifei1999_: systemd stuck in D state. :(
* 07:44 zhuyifei1999_: I saved dmesg and process list to a few files in /root if that helps debugging
* 07:43 zhuyifei1999_: D states are not responding to SIGKILL. Will reboot.
* 07:37 zhuyifei1999_: tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining.
 
=== 2019-02-22 ===
* 16:29 gtirloni: upgraded and rebooted tools-puppetmaster-01 (new kernel)
* 15:59 gtirloni: started tools-puppetmaster-01 (new size: m1.large)
* 15:13 gtirloni: shutdown tools-puppetmaster-01
 
=== 2019-02-21 ===
* 09:59 gtirloni: upgraded all packages in all stretch nodes
* 00:12 zhuyifei1999_: forcing puppet run on tools-k8s-master-01
* 00:08 zhuyifei1999_: running /usr/local/bin/git-sync-upstream on tools-puppetmaster-01 to speed puppet changes up
 
=== 2019-02-20 ===
* 23:30 zhuyifei1999_: begin rebuilding all docker images [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 23:25 zhuyifei1999_: upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version)
* 23:19 zhuyifei1999_: this was built for stretch. hopefully it works for all distros
* 23:17 zhuyifei1999_: begin build new tools-webservice package [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 21:57 andrewbogott: moving tools-static-13  to a new virt host
* 21:34 andrewbogott: moving the tools-static IP from tools-static-13 to tools-static-12
* 19:17 andrewbogott: moving tools-bastion-02 to labvirt1004
* 16:56 andrewbogott: moving tools-paws-worker-1003
* 15:53 andrewbogott: moving tools-worker-1017, tools-worker-1027, tools-worker-1028
* 15:04 andrewbogott: moving tools-exec-1413 and tools-exec-1442
 
=== 2019-02-19 ===
* 01:49 bd808: Revoked Toolforge project membership for user DannyS712 ([[phab:T215092|T215092]])
 
=== 2019-02-18 ===
* 20:45 gtirloni: upgraded and rebooted tools-sgebastion-07 (login-stretch)
* 20:22 gtirloni: enabled toolsdb monitoring in Icinga
* 20:03 gtirloni: pointed tools-db.eqiad.wmflabs to 172.16.7.153
* 18:50 chicocvenancio: moving paws back to toolsdb [[phab:T216208|T216208]]
* 13:47 arturo: rebooting tools-sgebastion-07 to try fixing general slowness
 
=== 2019-02-17 ===
* 22:23 zhuyifei1999_: uncordon tools-worker-1010.tools.eqiad.wmflabs
* 22:11 zhuyifei1999_: rebooting tools-worker-1010.tools.eqiad.wmflabs
* 22:10 zhuyifei1999_: draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever
 
=== 2019-02-16 ===
* 05:00 zhuyifei1999_: fixed by restarting flannel. another puppet run simply started kubelet
* 04:58 zhuyifei1999_: puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory'
* 04:52 zhuyifei1999_: copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet
* 04:48 zhuyifei1999_: that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago)
* 04:44 zhuyifei1999_: puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known'
* 04:43 zhuyifei1999_: this one has logs full of 'Can't contact LDAP server'
* 04:41 zhuyifei1999_: nslcd also broken on tools-worker-1005
* 04:34 zhuyifei1999_: uncordon tools-worker-1014.tools.eqiad.wmflabs
* 04:33 zhuyifei1999_: the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT
* 04:31 zhuyifei1999_: then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs
* 04:30 zhuyifei1999_: `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work
* 04:23 zhuyifei1999_: drained tools-worker-1014.tools.eqiad.wmflabs
* 04:16 zhuyifei1999_: logs: https://phabricator.wikimedia.org/P8095
* 04:14 zhuyifei1999_: restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs
* 04:12 zhuyifei1999_: restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP
 
=== 2019-02-14 ===
* 21:57 bd808: Deleted old tools-proxy-02 instance
* 21:57 bd808: Deleted old tools-proxy-01 instance
* 21:56 bd808: Deleted old tools-package-builder-01 instance
* 20:57 andrewbogott: rebooting tools-worker-1005
* 20:34 andrewbogott: moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419
* 19:55 andrewbogott: moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419
* 19:33 andrewbogott: moving tools-checker-01 to labvirt1003
* 19:25 andrewbogott: moving tools-elastic-02 to labvirt1003
* 19:11 andrewbogott: moving tools-k8s-etcd-01 to labvirt1002
* 18:37 andrewbogott: moving tools-exec-1418, tools-exec-1424 to labvirt1003
* 18:34 andrewbogott: moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002
* 17:35 arturo: [[phab:T215154|T215154]] tools-sgebastion-07 now running systemd 239 and starts enforcing user limits
* 15:33 andrewbogott: moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r
 
=== 2019-02-13 ===
* 19:16 andrewbogott: deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06
* 15:16 zhuyifei1999_: `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml{{!}}awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 15:06 zhuyifei1999_: `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 13:03 arturo: [[phab:T216030|T216030]] switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07
 
=== 2019-02-12 ===
* 01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers ([[phab:T215704|T215704]])
 
=== 2019-02-11 ===
* 22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
* 22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
* 22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
* 22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
* 22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
* 22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
* 22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
* 20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
* 19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 ([[phab:T107878|T107878]])
* 18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 ([[phab:T107878|T107878]])
* 18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1
 
=== 2019-02-08 ===
* 19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
* 18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for [[phab:T210829|T210829]].
* 13:49 gtirloni: upgraded all packages in SGE cluster
* 12:25 arturo: install aptitude in tools-sgebastion-06
* 11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - [[phab:T215272|T215272]]
* 01:07 bd808: Creating tools-sgebastion-07
 
=== 2019-02-07 ===
* 23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
* 20:18 gtirloni: cleared mail queue on tools-mail-02
* 08:41 godog: upgrade prometheus-02 to prometheus 2.6 - [[phab:T215272|T215272]]
 
=== 2019-02-04 ===
* 13:20 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06
* 12:26 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06. Puppet is disabled
* 11:38 arturo: [[phab:T215154|T215154]] reboot tools-sgebastion-06 to totally refresh systemd status
* 11:36 arturo: [[phab:T215154|T215154]] manually install systemd 239 in tools-sgebastion-06
 
=== 2019-01-30 ===
* 23:54 gtirloni: cleared apt cache on sge* hosts
 
=== 2019-01-25 ===
* 20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch ([[phab:T214668|T214668]])
* 14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for [[phab:T214447|T214447]]
* 14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for [[phab:T214447|T214447]]
 
=== 2019-01-24 ===
* 11:09 arturo: [[phab:T213421|T213421]] delete tools-services-01/02
* 09:46 arturo: [[phab:T213418|T213418]] delete tools-docker-registry-02
* 09:45 arturo: [[phab:T213418|T213418]] delete tools-docker-builder-05 and tools-docker-registry-01
* 03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01
 
=== 2019-01-23 ===
* 22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image ([[phab:T214519|T214519]])
* 22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image ([[phab:T214519|T214519]])
* 21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance ([[phab:T214519|T214519]])
* 20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon ([[phab:T214519|T214519]])
* 19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
* 17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 ([[phab:T211684|T211684]])
 
=== 2019-01-22 ===
* 20:21 gtirloni: published new docker images (all)
* 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs
 
=== 2019-01-21 ===
* 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet
 
=== 2019-01-18 ===
* 21:22 bd808: Forcing php-igbinary update via clush for [[phab:T213666|T213666]]
 
=== 2019-01-17 ===
* 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
* 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
* 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
* 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
* 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
* 17:16 arturo: [[phab:T213421|T213421]] shutdown tools-services-01/02. Will delete VMs after a grace period
* 12:54 arturo: add webservice security group to tools-sge-services-03/04
 
=== 2019-01-16 ===
* 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
* 16:38 arturo: [[phab:T213418|T213418]] shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
* 14:34 arturo: [[phab:T213418|T213418]] point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
* 14:24 arturo: [[phab:T213418|T213418]] allocate floating IPs for tools-docker-registry-03 & 04
 
=== 2019-01-15 ===
* 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
* 18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
* 18:29 bstorm_: [[phab:T213711|T213711]] installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
* 14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
* 14:21 arturo: [[phab:T213418|T213418]] put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`
 
=== 2019-01-14 ===
* 22:03 bstorm_: [[phab:T213711|T213711]] Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
* 22:03 bstorm_: [[phab:T213711|T213711]] Added ports needed for etcd-flannel to work on the etcd security group in eqiad
* 21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
* 21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
* 21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
* 21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
* 16:44 arturo: [[phab:T213418|T213418]] docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
* 14:00 arturo: [[phab:T213421|T213421]] disable updatetools in the new services nodes while building them
* 13:53 arturo: [[phab:T213421|T213421]] delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
* 13:47 arturo: [[phab:T213421|T213421]] create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`
 
=== 2019-01-11 ===
* 11:55 arturo: [[phab:T213418|T213418]] shutdown tools-docker-builder-05, will give a grace period before deleting the VM
* 10:51 arturo: [[phab:T213418|T213418]] created tools-docker-builder-06 in eqiad1
* 10:46 arturo: [[phab:T213418|T213418]] migrating tools-docker-registry-02 from eqiad to eqiad1
 
=== 2019-01-10 ===
* 22:45 bstorm_: [[phab:T213357|T213357]] - Added 24 lighttpd nodes tot he new grid
* 18:54 bstorm_: [[phab:T213355|T213355]] built and configured two more generic web nodes for the new grid
* 10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
* 00:12 bstorm_: [[phab:T213353|T213353]] Added 36 exec nodes to the new grid
 
=== 2019-01-09 ===
* 20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
* 17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
* 14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
* 09:59 gtirloni: rebooted tools-checker-01 ([[phab:T213252|T213252]])
 
=== 2019-01-07 ===
* 17:21 bstorm_: [[phab:T67777|T67777]] - set the max_u_jobs global grid config setting to 50 in the new grid
* 15:54 bstorm_: [[phab:T67777|T67777]] Set stretch grid user job limit to 16
* 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.
 
=== 2019-01-06 ===
* 22:06 bd808: Added floating ip to tools-sgebastion-06 ([[phab:T212360|T212360]])
 
=== 2019-01-05 ===
* 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.
 
=== 2019-01-04 ===
* 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history
 
=== 2019-01-03 ===
* 21:03 bd808: Enabled Puppet on tools-proxy-02
* 20:53 bd808: Disabled Puppet on tools-proxy-02
* 20:51 bd808: Enabled Puppet on tools-proxy-01
* 20:49 bd808: Disabled Puppet on tools-proxy-01
 
=== 2018-12-21 ===
* 16:29 andrewbogott: migrating tools-exec-1416  to labvirt1004
* 16:01 andrewbogott: moving tools-grid-master to labvirt1004
* 00:35 bd808: Installed tools-manifest 0.14 for [[phab:T212390|T212390]]
* 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for [[phab:T212390|T212390]]
* 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for [[phab:T212390|T212390]]
* 00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for [[phab:T212390|T212390]]
 
=== 2018-12-20 ===
* 20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
* 20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
* 20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002
 
=== 2018-12-17 ===
* 22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - [[phab:T212153|T212153]]
* 19:18 gtirloni: decreased nfs-mount-manager verbosity ([[phab:T211817|T211817]])
* 19:02 arturo: [[phab:T211977|T211977]] add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
* 13:46 arturo: [[phab:T211977|T211977]] `aborrero@tools-services-01:~$  sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`
 
=== 2018-12-11 ===
* 13:19 gtirloni: Removed BigBrother ([[phab:T208357|T208357]])
 
=== 2018-12-05 ===
* 12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster ([[phab:T196973|T196973]])
 
=== 2018-12-04 ===
* 22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage [[phab:T164123|T164123]]
* 20:03 gtirloni: removed floating IPs from tools-k8s-master-01 ([[phab:T164123|T164123]])
 
=== 2018-12-01 ===
* 02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 ([[phab:T194615|T194615]])
* 00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts
 
=== 2018-11-30 ===
* 23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
* 22:18 gtirloni: Pushed new jdk8 docker image based on stretch ([[phab:T205774|T205774]])
* 18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance ([[phab:T194615|T194615]])
 
=== 2018-11-27 ===
* 17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb
 
=== 2018-11-26 ===
* 17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) ([[phab:T210190|T210190]])
* 17:34 gtirloni: [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (again)
* 13:31 gtirloni: deleted instance tools-clushmaster-01 ([[phab:T209701|T209701]])
 
=== 2018-11-20 ===
* 23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
* 14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
* 13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
* 10:52 arturo: [[phab:T208579|T208579]] distributing now misctools and jobutils 1.33 in all aptly repos
* 09:43 godog: restart prometheus@tools on prometheus-01
 
=== 2018-11-16 ===
* 21:16 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
* 17:47 gtirloni: deleted tools-mail instance
* 17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
* 17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
* 13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades
 
=== 2018-11-14 ===
* 17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
* 17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
* 17:15 andrewbogott: moving tools-exec-1420 to labvirt1009
 
=== 2018-11-13 ===
* 17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo ([[phab:T207970|T207970]])
* 13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
* 13:29 gtirloni: Changed active mail relay to tools-mail-02 ([[phab:T209356|T209356]])
* 13:22 arturo: [[phab:T207970|T207970]] misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
* 13:05 arturo: [[phab:T207970|T207970]] there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
* 12:59 arturo: the puppet issue has been solved by reverting the code
* 12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit
 
=== 2018-11-08 ===
* 18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
* 17:58 arturo: installing jobutils and misctools v1.32 ([[phab:T207970|T207970]])
* 17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
* 16:56 gtirloni: cleaned up /tmp on tools-bastion-05
* 16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
* 16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
* 11:32 gtirloni: removed temporary /var/mail fix ([[phab:T208843|T208843]])
 
=== 2018-11-07 ===
* 10:37 gtirloni: removed invalid apt.conf.d file from all hosts ([[phab:T110055|T110055]])
 
=== 2018-11-02 ===
* 18:11 arturo: [[phab:T206223|T206223]] some disturbances due to the certificate renewal
* 17:04 arturo: renewing *.wmflabs.org [[phab:T206223|T206223]]
 
=== 2018-10-31 ===
* 18:02 gtirloni: truncated big .err and error.log files
* 13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde
 
=== 2018-10-29 ===
* 17:00 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]
 
=== 2018-10-26 ===
* 10:34 arturo: [[phab:T207970|T207970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
* 10:32 arturo: [[phab:T209970|T209970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
 
=== 2018-10-19 ===
* 14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
* 00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017
 
=== 2018-10-18 ===
* 19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017
 
=== 2018-10-16 ===
* 15:13 bd808: (repost for gtirloni) [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (leftover from [[phab:T165624|T165624]] legofan4000->macfan4000 rename)
 
=== 2018-10-07 ===
* 21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 [[phab:T194859|T194859]]
* 21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
* 21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens
 
=== 2018-09-21 ===
* 12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
* 12:14 arturo: [[phab:T205078|T205078]] same for {jessie,stretch}-wikimedia
* 12:12 arturo: [[phab:T205078|T205078]] upgrade trusty-wikimedia packages (git-fat, debmonitor)
* 11:57 arturo: [[phab:T205078|T205078]] purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines
 
=== 2018-09-17 ===
* 09:13 arturo: [[phab:T204481|T204481]] aborrero@tools-mail:~$ sudo exiqgrep -i {{!}} xargs sudo exim -Mrm
 
=== 2018-09-14 ===
* 11:22 arturo: [[phab:T204267|T204267]] stop the corhist tool (k8s) because is hammering the wikidata API
* 10:51 arturo: [[phab:T204267|T204267]] stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API
 
=== 2018-09-08 ===
* 10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog ([[phab:T196137|T196137]])
 
=== 2018-09-07 ===
* 05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb
 
=== 2018-08-27 ===
* 23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
* 23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
* 22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
 
=== 2018-08-22 ===
* 13:02 arturo: I used this command: `sudo exim -bp {{!}} sudo exiqgrep -i {{!}} xargs sudo exim -Mrm`
* 13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com
 
=== 2018-08-19 ===
* 09:12 legoktm: rebuilding python/base k8s images for https://gerrit.wikimedia.org/r/453665 ([[phab:T202218|T202218]])
 
=== 2018-08-14 ===
* 21:02 legoktm: rebuilt php7.2 docker images for https://gerrit.wikimedia.org/r/452755
* 01:08 legoktm: switched tools.coverme and tools.wikiinfo to use PHP 7.2
 
=== 2018-08-13 ===
* 23:31 legoktm: rebuilding docker images for webservice upgrade
* 23:16 legoktm: published toollabs-webservice_0.41_all.deb
* 23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice
 
=== 2018-08-09 ===
* 10:40 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-backports (excluding python-designateclient)
* 10:30 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-wikimedia
* 10:27 arturo: [[phab:T201602|T201602]] upgrade packages from trusty-updates
 
=== 2018-08-08 ===
* 10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images [[phab:T156626|T156626]] [[phab:T148872|T148872]] [[phab:T158244|T158244]]
 
=== 2018-08-06 ===
* 12:33 arturo: [[phab:T197176|T197176]] installing texlive-full in toolforge
 
=== 2018-08-01 ===
* 14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break
 
=== 2018-07-30 ===
* 20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
* 04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools
 
=== 2018-07-27 ===
* 04:52 zhuyifei1999_: rebuilding python/base docker container [[phab:T190274|T190274]]
 
=== 2018-07-25 ===
* 19:02 chasemp: tools-worker-1004 reboot
* 19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)
 
=== 2018-07-18 ===
* 13:24 arturo: upgrading packages from `stretch-wikimedia` [[phab:T199905|T199905]]
* 13:18 arturo: upgrading packages from `stable` [[phab:T199905|T199905]]
* 12:51 arturo: upgrading packages from `oldstable` [[phab:T199905|T199905]]
* 12:31 arturo: upgrading packages from `trusty-updates` [[phab:T199905|T199905]]
* 12:16 arturo: upgrading packages from `jessie-wikimedia` [[phab:T199905|T199905]]
* 12:09 arturo: upgrading packages from `trusty-wikimedia` [[phab:T199905|T199905]]
 
=== 2018-06-30 ===
* 18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
* 16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
* 16:39 zhuyifei1999_: reboot tools-paws-master-01
* 16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
* 16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere
 
=== 2018-06-29 ===
* 17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
* 17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU ([[phab:T123121|T123121]])
* 16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. [[phab:T182070|T182070]]
 
=== 2018-06-28 ===
* 19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
* 18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
* 17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
* 17:20 chasemp: tools-worker-1007:~# /sbin/reboot
* 16:48 arturo: rebooting tools-docker-registry-01
* 16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
* 16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck
 
=== 2018-06-21 ===
* 13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash
 
=== 2018-06-20 ===
* 15:09 bd808: Killed orphan processes on webgrid nodes ([[phab:T182070|T182070]]); most owned by jembot and croptool
 
=== 2018-06-14 ===
* 14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash
 
=== 2018-06-11 ===
* 10:11 arturo: [[phab:T196137|T196137]] `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null {{!}} grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart {{!}}{{!}} true'`
 
=== 2018-06-08 ===
* 07:46 arturo: [[phab:T196137|T196137]] more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes
 
=== 2018-06-07 ===
* 11:01 arturo: [[phab:T196137|T196137]] force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`
 
=== 2018-06-06 ===
* 22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt ([[phab:T196589|T196589]])
* 21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
* 20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
* 19:04 chasemp: tools-bastion-03 is virtually unusable
* 09:49 arturo: [[phab:T196137|T196137]] aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid
 
=== 2018-06-05 ===
* 18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben ([[phab:T196486|T196486]])
* 17:39 arturo: [[phab:T196137|T196137]] clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
* 17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs ([[phab:T196486|T196486]])
 
=== 2018-06-04 ===
* 10:28 arturo: [[phab:T196006|T196006]] installing sqlite3 package in exec nodes
 
=== 2018-06-03 ===
* 10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' [[phab:T195834|T195834]]
 
=== 2018-05-31 ===
* 11:31 zhuyifei1999_: building & pushing python/web docker image [[phab:T174769|T174769]]
* 11:13 zhuyifei1999_: force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101
 
=== 2018-05-30 ===
* 10:52 zhuyifei1999_: undid both changes to tools-bastion-05
* 10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
* 10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close [[phab:T195834|T195834]]
 
=== 2018-05-28 ===
* 12:09 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
* 12:06 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for trusty-wikimedia
 
=== 2018-05-25 ===
* 05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty [[phab:T195558|T195558]]
 
=== 2018-05-22 ===
* 11:53 arturo: running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for [[phab:T194665|T194665]] (mono framework update)
 
=== 2018-05-18 ===
* 16:36 bd808: Restarted bigbrother on tools-services-02
 
=== 2018-05-16 ===
* 21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds
 
=== 2018-05-15 ===
* 04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414.  It's hanging for unknown reasons.
* 04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
* 04:05 zhuyifei1999_: Force deletion of grid job {{Gerrit|5221417}} (tools.giftbot sga), host tools-exec-1414 not responding
 
=== 2018-05-12 ===
* 10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop {{!}} [[phab:T194343|T194343]]
 
=== 2018-05-11 ===
* 14:34 andrewbogott: repooling labvirt1001 tools instances
* 13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for [[phab:T194258|T194258]]:  tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
 
=== 2018-05-10 ===
* 18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update
 
=== 2018-05-09 ===
* 21:11 Reedy: Added Tim Starling as member/admin
 
=== 2018-05-07 ===
* 21:02 zhuyifei1999_: re-building all docker images [[phab:T190893|T190893]]
* 20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 [[phab:T190893|T190893]]
* 00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours
 
=== 2018-05-05 ===
* 23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing
 
=== 2018-05-03 ===
* 14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package [[phab:T192566|T192566]]
 
=== 2018-05-01 ===
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)
 
=== 2018-04-27 ===
* 18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
* 18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker
 
=== 2018-04-23 ===
* 14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools [[phab:T192732|T192732]]
 
=== 2018-04-22 ===
* 13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd {{!}} grep -E "    1 " {{!}} grep php-cgi {{!}} xargs sudo kill -9'`