You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(gtirloni: published new docker images (all))
imported>Stashbot
(bd808: Rebuilding container images (T337897))
 
(619 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== 2019-01-22 ===
=== 2023-06-08 ===
* 20:21 gtirloni: published new docker images (all)
* 20:21 bd808: Rebuilding container images ([[phab:T337897|T337897]])
* 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs
* 14:16 dcaro: restart tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud due to nfs hiccup
* 14:07 dcaro: restarting the tools-sgeexec-10-17 node due to nfs hiccup
* 14:00 dcaro: restarting the tools-sgegrid-master node due to nfs hiccup
* 12:00 dcaro: powering off tools-k8s-etcd-18 ([[phab:T334644|T334644]])
* 07:18 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|24e7828}}) - cookbook ran by taavi@runko


=== 2019-01-21 ===
=== 2023-06-07 ===
* 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet
* 12:45 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:a5eb7dc from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|a5eb7dc}}) - cookbook ran by taavi@runko


=== 2019-01-18 ===
=== 2023-06-05 ===
* 21:22 bd808: Forcing php-igbinary update via clush for [[phab:T213666|T213666]]
* 07:53 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2019-01-17 ===
=== 2023-06-01 ===
* 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
* 10:07 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|7e57832}}) ([[phab:T337218|T337218]]) - cookbook ran by dcaro@vulcanus
* 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
* 09:21 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|0f4076a}}) ([[phab:T336130|T336130]]) - cookbook ran by dcaro@vulcanus
* 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
* 09:18 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/buildpack-admission-controller ({{Gerrit|ef7f103}}) ([[phab:T336130|T336130]]) - cookbook ran by dcaro@vulcanus
* 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
* 07:52 dcaro: rebooted tools-package-builder-04 (stuck not letting me log in with my user)
* 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
* 17:16 arturo: [[phab:T213421|T213421]] shutdown tools-services-01/02. Will delete VMs after a grace period
* 12:54 arturo: add webservice security group to tools-sge-services-03/04


=== 2019-01-16 ===
=== 2023-05-31 ===
* 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
* 02:38 andrewbogott: rebooted tools-sgeweblight-10-16[[phab:T337806|T337806]]
* 16:38 arturo: [[phab:T213418|T213418]] shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
* 14:34 arturo: [[phab:T213418|T213418]] point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
* 14:24 arturo: [[phab:T213418|T213418]] allocate floating IPs for tools-docker-registry-03 & 04


=== 2019-01-15 ===
=== 2023-05-30 ===
* 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
* 00:22 andrewbogott: rebooted tools-sgeweblight-10-30,  oom
* 18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
* 00:16 andrewbogott: rebooted tools-sgeweblight-10-24, seems to be oom
* 18:29 bstorm_: [[phab:T213711|T213711]] installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
* 14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
* 14:21 arturo: [[phab:T213418|T213418]] put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`


=== 2019-01-14 ===
=== 2023-05-26 ===
* 22:03 bstorm_: [[phab:T213711|T213711]] Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
* 13:13 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/buildpack-admission-controller ({{Gerrit|ef7f103}}) ([[phab:T337218|T337218]]) - cookbook ran by dcaro@vulcanus
* 22:03 bstorm_: [[phab:T213711|T213711]] Added ports needed for etcd-flannel to work on the etcd security group in eqiad
* 12:59 dcaro: rebooting tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud for stale NFS handles (D processes)
* 21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
* 21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
* 21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
* 21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
* 16:44 arturo: [[phab:T213418|T213418]] docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
* 14:00 arturo: [[phab:T213421|T213421]] disable updatetools in the new services nodes while building them
* 13:53 arturo: [[phab:T213421|T213421]] delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
* 13:47 arturo: [[phab:T213421|T213421]] create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`


=== 2019-01-11 ===
=== 2023-05-24 ===
* 11:55 arturo: [[phab:T213418|T213418]] shutdown tools-docker-builder-05, will give a grace period before deleting the VM
* 12:28 dcaro: deploy latest buildservice ([[phab:T335865|T335865]])
* 10:51 arturo: [[phab:T213418|T213418]] created tools-docker-builder-06 in eqiad1
* 12:28 dcaro: deploy latest buildservice ([[phab:T336050|T336050]])
* 10:46 arturo: [[phab:T213418|T213418]] migrating tools-docker-registry-02 from eqiad to eqiad1


=== 2019-01-10 ===
=== 2023-05-23 ===
* 22:45 bstorm_: [[phab:T213357|T213357]] - Added 24 lighttpd nodes tot he new grid
* 14:40 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|0c7b25b}}) - cookbook ran by fran@wmf3169
* 18:54 bstorm_: [[phab:T213355|T213355]] built and configured two more generic web nodes for the new grid
* 10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
* 00:12 bstorm_: [[phab:T213353|T213353]] Added 36 exec nodes to the new grid


=== 2019-01-09 ===
=== 2023-05-22 ===
* 20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
* 10:06 arturo: hard-reboot tools-sgeexec-10-18 (monitoring reporting it as down)
* 17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
* 14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
* 09:59 gtirloni: rebooted tools-checker-01 ([[phab:T213252|T213252]])


=== 2019-01-07 ===
=== 2023-05-19 ===
* 17:21 bstorm_: [[phab:T67777|T67777]] - set the max_u_jobs global grid config setting to 50 in the new grid
* 13:38 arturo: uncordon tools-k8s-worker-47/48/64/75
* 15:54 bstorm_: [[phab:T67777|T67777]] Set stretch grid user job limit to 16
* 08:46 bd808: Building new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images ([[phab:T323522|T323522]], [[phab:T320904|T320904]])
* 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.


=== 2019-01-06 ===
=== 2023-05-17 ===
* 22:06 bd808: Added floating ip to tools-sgebastion-06 ([[phab:T212360|T212360]])
* 16:05 dcaro: release toolforge-cli 0.3.0 ([[phab:T336225|T336225]])
* 12:48 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|fa8ed2c}}) ([[phab:T336225|T336225]]) - cookbook ran by dcaro@vulcanus
* 12:48 wm-bot2: rebooted k8s node tools-k8s-worker-71 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 12:45 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|d1bb238}}) ([[phab:T336225|T336225]]) - cookbook ran by dcaro@vulcanus
* 12:43 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api ({{Gerrit|8d21314}}) - cookbook ran by dcaro@vulcanus
* 10:54 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:7199a9e from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|7199a9e}}) - cookbook ran by fran@wmf3169
* 08:49 wm-bot2: rebooted k8s node tools-k8s-worker-55 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:33 wm-bot2: rebooted k8s node tools-k8s-worker-64 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:32 wm-bot2: rebooted k8s node tools-k8s-worker-75 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:25 wm-bot2: rebooted k8s node tools-k8s-worker-74 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:17 wm-bot2: rebooted k8s node tools-k8s-worker-61 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:10 wm-bot2: rebooted k8s node tools-k8s-worker-70 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 08:03 wm-bot2: rebooted k8s node tools-k8s-worker-66 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:54 wm-bot2: rebooted k8s node tools-k8s-worker-72 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:46 wm-bot2: rebooted k8s node tools-k8s-worker-47 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:45 wm-bot2: rebooted k8s node tools-k8s-worker-48 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:42 wm-bot2: rebooted k8s node tools-k8s-worker-69 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 07:29 wm-bot2: rebooted k8s node tools-k8s-worker-76 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus


=== 2019-01-05 ===
=== 2023-05-16 ===
* 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.
* 23:24 bd808: kubectl uncordon tools-k8s-worker-69
* 23:22 bd808: Force reboot tools-k8s-worker-69 via Horizon
* 23:18 bd808: kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-69
* 23:17 bd808: kubectl cordon tools-k8s-worker-69
* 14:37 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/builds-api:35b57c6 from https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api.git ({{Gerrit|35b57c6}}) - cookbook ran by dcaro@vulcanus
* 13:05 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|df52a39}}) ([[phab:T334081|T334081]]) - cookbook ran by dcaro@vulcanus
* 12:54 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|ad5b2b5}}) ([[phab:T334081|T334081]]) - cookbook ran by dcaro@vulcanus
* 11:52 dcaro: release toolforge-weld 0.2.0 and toolforge-webservice 0.98
* 08:08 dcaro: reboot tools-mail-03 ([[phab:T316544|T316544]])
* 08:07 dcaro: reboot tools-sgebastion-10 ([[phab:T316544|T316544]])


=== 2019-01-04 ===
=== 2023-05-15 ===
* 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history
* 22:50 bd808: Rebuilding bullseye and buster docker containers to pick up make package addition ([[phab:T320343|T320343]])
* 22:09 wm-bot2: rebooted k8s node tools-k8s-worker-66 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 22:07 wm-bot2: rebooted k8s node tools-k8s-worker-65 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 22:06 wm-bot2: rebooted k8s node tools-k8s-worker-64 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 22:04 wm-bot2: rebooted k8s node tools-k8s-worker-62 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 22:02 wm-bot2: rebooted k8s node tools-k8s-worker-61 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:58 wm-bot2: rebooted k8s node tools-k8s-worker-60 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:56 wm-bot2: rebooted k8s node tools-k8s-worker-59 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:54 wm-bot2: rebooted k8s node tools-k8s-worker-58 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:52 wm-bot2: rebooted k8s node tools-k8s-worker-57 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:51 wm-bot2: rebooted k8s node tools-k8s-worker-56 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:50 wm-bot2: rebooted k8s node tools-k8s-worker-55 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:49 wm-bot2: rebooted k8s node tools-k8s-worker-54 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:47 wm-bot2: rebooted k8s node tools-k8s-worker-53 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:44 wm-bot2: rebooted k8s node tools-k8s-worker-52 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:42 wm-bot2: rebooted k8s node tools-k8s-worker-51 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:41 wm-bot2: rebooted k8s node tools-k8s-worker-50 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:40 wm-bot2: rebooted k8s node tools-k8s-worker-49 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:38 wm-bot2: rebooted k8s node tools-k8s-worker-48 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:37 wm-bot2: rebooted k8s node tools-k8s-worker-47 ([[phab:T316544|T316544]]) - cookbook ran by andrew@bullseye
* 21:33 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by andrew@bullseye
* 21:16 wm-bot2: rebooted k8s node tools-k8s-worker-45 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:15 wm-bot2: rebooted k8s node tools-k8s-worker-44 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:13 wm-bot2: rebooted k8s node tools-k8s-worker-43 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:12 wm-bot2: rebooted k8s node tools-k8s-worker-42 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:09 wm-bot2: rebooted k8s node tools-k8s-worker-41 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 21:03 wm-bot2: rebooted k8s node tools-k8s-worker-40 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:58 wm-bot2: rebooted k8s node tools-k8s-worker-39 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:52 wm-bot2: rebooted k8s node tools-k8s-worker-38 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:50 wm-bot2: rebooted k8s node tools-k8s-worker-37 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:49 wm-bot2: rebooted k8s node tools-k8s-worker-36 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:48 wm-bot2: rebooted k8s node tools-k8s-worker-35 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:47 wm-bot2: rebooted k8s node tools-k8s-worker-34 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:42 wm-bot2: rebooted k8s node tools-k8s-worker-33 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:41 andrewbogott: rebooting frozen VMs: tools-k8s-worker-65, tools-sgeweblight-10-27, tools-k8s-worker-45, tools-k8s-worker-36, tools-sgewebgen-10-3 (fallout from earlier nfs outage)
* 20:36 wm-bot2: rebooted k8s node tools-k8s-worker-32 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:32 wm-bot2: rebooted k8s node tools-k8s-worker-31 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 20:24 wm-bot2: rebooted k8s node tools-k8s-worker-30 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 19:04 wm-bot2: rebooted k8s node tools-k8s-worker-67 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:56 wm-bot2: rebooted k8s node tools-k8s-worker-68 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:49 wm-bot2: rebooted k8s node tools-k8s-worker-69 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:46 bd808: Hard reboot tools-static-14 via Horizon per IRC report of unresponsive requests
* 18:44 wm-bot2: rebooted k8s node tools-k8s-worker-70 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:42 wm-bot2: rebooted k8s node tools-k8s-worker-71 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:39 wm-bot2: rebooted k8s node tools-k8s-worker-72 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:34 wm-bot2: rebooted k8s node tools-k8s-worker-73 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:28 wm-bot2: rebooted k8s node tools-k8s-worker-74 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:22 wm-bot2: rebooted k8s node tools-k8s-worker-75 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:22 taavi: clear mail queue
* 18:21 wm-bot2: rebooted k8s node tools-k8s-worker-76 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:15 wm-bot2: rebooted k8s node tools-k8s-worker-77 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:08 wm-bot2: rebooted k8s node tools-k8s-worker-80 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:06 wm-bot2: rebooted k8s node tools-k8s-worker-81 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 18:05 wm-bot2: rebooted k8s node tools-k8s-worker-82 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:57 wm-bot2: rebooted k8s node tools-k8s-worker-83 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:48 wm-bot2: rebooted k8s node tools-k8s-worker-84 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:47 wm-bot2: rebooted k8s node tools-k8s-worker-85 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:38 wm-bot2: rebooted k8s node tools-k8s-worker-86 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:37 wm-bot2: rebooted k8s node tools-k8s-worker-87 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:35 wm-bot2: rebooted k8s node tools-k8s-worker-88 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:34 wm-bot2: rebooting all the workers of tools k8s cluster (64 nodes) ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:20 wm-bot2: rebooted k8s node tools-k8s-worker-87 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:19 wm-bot2: rebooted k8s node tools-k8s-worker-88 ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:17 bd808: Rebuilding bullseye and buster docker containers to pick up openssh-client package addition ([[phab:T258841|T258841]])
* 17:12 wm-bot2: rebooting the whole tools k8s cluster (64 nodes) ([[phab:T316544|T316544]]) - cookbook ran by dcaro@vulcanus
* 17:06 dcaro: rebooting tools-sgegrid-shadow ([[phab:T316544|T316544]])
* 17:00 dcaro: rebooting tools-sgegrid-master ([[phab:T316544|T316544]])
* 16:55 dcaro: rebooting tools-sgeexec-10-20 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-18 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-25 ([[phab:T316544|T316544]])
* 16:53 dcaro: rebooting tools-sgeweblight-10-20 ([[phab:T316544|T316544]])
* 16:52 dcaro: rebooting tools-sgeweblight-10-21 ([[phab:T316544|T316544]])
* 16:52 dcaro: rebooting tools-sgeexec-10-22 ([[phab:T316544|T316544]])
* 16:51 dcaro: rebooting tools-sgeweblight-10-28 ([[phab:T316544|T316544]])
* 16:50 dcaro: rebooting tools-sgeexec-10-17 ([[phab:T316544|T316544]])
* 16:48 dcaro: rebooting tools-sgeexec-10-21 ([[phab:T316544|T316544]])
* 16:47 dcaro: rebooting tools-sgeexec-10-19 ([[phab:T316544|T316544]])
* 16:45 dcaro: rebooting tools-sgeexec-10-8 ([[phab:T316544|T316544]])
* 16:45 dcaro: rebooting tools-sgeweblight-10-24 ([[phab:T316544|T316544]])
* 16:44 dcaro: rebooting tools-sgewebgen-10-2 ([[phab:T316544|T316544]])
* 16:44 dcaro: rebooting tools-sgeweblight-10-16 ([[phab:T316544|T316544]])
* 16:43 dcaro: rebooting tools-sgeweblight-10-30 ([[phab:T316544|T316544]])
* 16:43 dcaro: rebooting tools-sgeexec-10-18 ([[phab:T316544|T316544]])
* 16:42 dcaro: rebooting tools-sgeexec-10-16 ([[phab:T316544|T316544]])
* 16:42 dcaro: rebooting tools-sgeexec-10-14 ([[phab:T316544|T316544]])
* 16:41 dcaro: rebooting tools-sgeweblight-10-32 ([[phab:T316544|T316544]])
* 16:40 dcaro: rebooting tools-sgeweblight-10-22 ([[phab:T316544|T316544]])
* 16:39 dcaro: rebooting tools-sgeweblight-10-17 ([[phab:T316544|T316544]])
* 16:32 dcaro: rebooting tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud ([[phab:T316544|T316544]])
* 16:23 dcaro: rebooting tools-sgeweblight-10-26 ([[phab:T316544|T316544]])
* 16:15 bd808: Hard reboot of tools-sgebastion-11 via Horizon (done circa 16:11Z)
* 16:14 arturo: rebooted a bunch of nodes to cleanup D procs and high load avg because NFS outage (result of [[phab:T316544|T316544]])
* 12:36 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/builds-api:09f3b49-dev from https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api.git ({{Gerrit|32a8ae9}}) - cookbook ran by dcaro@vulcanus
* 09:12 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:c64da5a from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|c64da5a}}) - cookbook ran by dcaro@vulcanus


=== 2019-01-03 ===
=== 2023-05-13 ===
* 21:03 bd808: Enabled Puppet on tools-proxy-02
* 09:13 taavi: reboot tools-sgeexec-10-15,17,18,21
* 20:53 bd808: Disabled Puppet on tools-proxy-02
* 20:51 bd808: Enabled Puppet on tools-proxy-01
* 20:49 bd808: Disabled Puppet on tools-proxy-01


=== 2018-12-21 ===
=== 2023-05-11 ===
* 16:29 andrewbogott: migrating tools-exec-1416  to labvirt1004
* 15:48 bd808: Rebooted tools-sgebastion-10 for [[phab:T336510|T336510]]
* 16:01 andrewbogott: moving tools-grid-master to labvirt1004
* 15:31 bd808: Sent `wall` for reboot of tools-sgebastion-10 circa 15:40Z
* 00:35 bd808: Installed tools-manifest 0.14 for [[phab:T212390|T212390]]
* 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for [[phab:T212390|T212390]]
* 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for [[phab:T212390|T212390]]
* 00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for [[phab:T212390|T212390]]


=== 2018-12-20 ===
=== 2023-05-09 ===
* 20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
* 16:36 taavi: delegated beta.toolforge.org domain to toolsbeta per [[phab:T257386|T257386]]
* 20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
* 09:35 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|ad4fa2a}}) - cookbook ran by taavi@runko
* 20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002


=== 2018-12-17 ===
=== 2023-05-08 ===
* 22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - [[phab:T212153|T212153]]
* 09:12 arturo: force-reboot tools-sgeexec-10-13 (reported as down by the monitoring, no SSH)
* 19:18 gtirloni: decreased nfs-mount-manager verbosity ([[phab:T211817|T211817]])
* 19:02 arturo: [[phab:T211977|T211977]] add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
* 13:46 arturo: [[phab:T211977|T211977]] `aborrero@tools-services-01:~$  sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`


=== 2018-12-11 ===
=== 2023-05-07 ===
* 13:19 gtirloni: Removed BigBrother ([[phab:T208357|T208357]])
* 16:06 taavi: remove inbound 25/tcp rule from the toolserver legacy server [[phab:T136225|T136225]]


=== 2018-12-05 ===
=== 2023-05-05 ===
* 12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster ([[phab:T196973|T196973]])
* 22:21 bd808: Added "RepoLookoutBot" to hiera key "dynamicproxy::blocked_user_agent_regex" to stop unnecessary scans by https://www.repo-lookout.org/
* 22:20 bd808: Added
* 11:30 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:811164e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|811164e}}) - cookbook ran by taavi@runko
* 09:13 dcaro: rebooted tools-sgeexec-10-16 as it was stuck ([[phab:T335009|T335009]])


=== 2018-12-04 ===
=== 2023-05-04 ===
* 22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage [[phab:T164123|T164123]]
* 15:15 wm-bot2: removed instance tools-k8s-etcd-15 - cookbook ran by andrew@bullseye
* 20:03 gtirloni: removed floating IPs from tools-k8s-master-01 ([[phab:T164123|T164123]])
* 14:13 wm-bot2: removed instance tools-k8s-etcd-14 - cookbook ran by andrew@bullseye


=== 2018-12-01 ===
=== 2023-05-03 ===
* 02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 ([[phab:T194615|T194615]])
* 12:41 wm-bot2: removed instance tools-k8s-etcd-13 - cookbook ran by andrew@bullseye
* 00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts


=== 2018-11-30 ===
=== 2023-05-02 ===
* 23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
* 00:29 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|7199a9e}}) - cookbook ran by raymond@ubuntu
* 22:18 gtirloni: Pushed new jdk8 docker image based on stretch ([[phab:T205774|T205774]])
* 18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance ([[phab:T194615|T194615]])


=== 2018-11-27 ===
=== 2023-05-01 ===
* 17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb
* 23:17 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:3b3803f from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|3b3803f}}) - cookbook ran by raymond@ubuntu


=== 2018-11-26 ===
=== 2023-04-28 ===
* 17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) ([[phab:T210190|T210190]])
* 15:01 arturo: force reboot tools-k8s-worker-79, unresponsive
* 17:34 gtirloni: [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (again)
* 08:27 dcaro: rebooting tools-sgeweblight-10-28 ([[phab:T335336|T335336]])
* 13:31 gtirloni: deleted instance tools-clushmaster-01 ([[phab:T209701|T209701]])
* 07:20 dcaro: rebooting tools-sgegrid-shadow due to stale nfs mount
* 00:09 bd808: `kubectl uncordon tools-k8s-worker-67` ([[phab:T335543|T335543]])
* 00:07 bd808: Hard reboot tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud via horizon ([[phab:T335543|T335543]])
* 00:04 bd808: Rebooting tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud ([[phab:T335543|T335543]])


=== 2018-11-20 ===
=== 2023-04-27 ===
* 23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
* 23:59 bd808: `kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-67` ([[phab:T335543|T335543]])
* 14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
* 20:50 bd808: Started process to rebuild all buster and bullseye based container images again. Prior problem seems to have been stale images in local cache on the build server.
* 13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
* 20:42 bd808: Container image rebuild failed with GPG errors in buster-sssd base image. Will investigate and attempt to restart once resolved in a local dev environment.
* 10:52 arturo: [[phab:T208579|T208579]] distributing now misctools and jobutils 1.33 in all aptly repos
* 20:33 bd808: Started process to rebuild all buster and bullseye based container images per https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images
* 09:43 godog: restart prometheus@tools on prometheus-01


=== 2018-11-16 ===
=== 2023-04-18 ===
* 21:16 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
* 16:46 dcaro: force-rebooting tools-sgeweblight-10-25/26/27 as they got stuck stopping the grid_exec process
* 17:47 gtirloni: deleted tools-mail instance
* 16:35 dcaro: rebooting root@tools-sgeweblight-10-27 due to stuck exec daemon not releasing port 6445
* 17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
* 16:35 dcaro: rebooting root@tools-sgeweblight-10-25 due to stuck exec daemon not releasing port 6445
* 17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
* 16:32 dcaro: rebooting root@tools-sgeweblight-10-26 due to stuck exec daemon not releasing port 6445
* 13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades
* 16:26 dcaro: rebooting root@tools-sgeexec-10-14 due to stuck exec daemon not releasing port 6445


=== 2018-11-14 ===
=== 2023-04-17 ===
* 17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
* 13:10 dcaro: rebooting tools-sgegrid-master node ([[phab:T334847|T334847]])
* 17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
* 02:43 legoktm: manual restart of apache2 on toolserver-proxy-1 to completely pick up renewed TLS cert (alert was flapping)
* 17:15 andrewbogott: moving tools-exec-1420 to labvirt1009


=== 2018-11-13 ===
=== 2023-04-11 ===
* 17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo ([[phab:T207970|T207970]])
* 16:11 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|b65439b}}) - cookbook ran by arturo@nostromo
* 13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
* 15:46 arturo: upload toolforge-jobs-framework-cli v11 to aptly
* 13:29 gtirloni: Changed active mail relay to tools-mail-02 ([[phab:T209356|T209356]])
* 14:17 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller.git ({{Gerrit|d878e49}}) ([[phab:T324834|T324834]]) - cookbook ran by dcaro@vulcanus
* 13:22 arturo: [[phab:T207970|T207970]] misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
* 13:19 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:c6c693c from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c6c693c}}) - cookbook ran by arturo@nostromo
* 13:05 arturo: [[phab:T207970|T207970]] there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
* 12:09 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:40bd3b3 from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|40bd3b3}}) - cookbook ran by dcaro@vulcanus
* 12:59 arturo: the puppet issue has been solved by reverting the code
* 10:34 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|9aed7e5}}) - cookbook ran by taavi@runko
* 12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit
* 09:15 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/calico ({{Gerrit|c6a3e29}}) ([[phab:T329677|T329677]]) - cookbook ran by taavi@runko
* 08:45 wm-bot2: Adding a new k8s worker node - cookbook ran by taavi@runko


=== 2018-11-08 ===
=== 2023-04-10 ===
* 18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
* 10:46 taavi: patch existing PSP roles to use policy/v1beta1 [[phab:T331619|T331619]]
* 17:58 arturo: installing jobutils and misctools v1.32 ([[phab:T207970|T207970]])
* 09:16 arturo: upgrading k8s cluster to 1.22 ([[phab:T286856|T286856]])
* 17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
* 16:56 gtirloni: cleaned up /tmp on tools-bastion-05
* 16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
* 16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
* 11:32 gtirloni: removed temporary /var/mail fix ([[phab:T208843|T208843]])


=== 2018-11-07 ===
=== 2023-04-07 ===
* 10:37 gtirloni: removed invalid apt.conf.d file from all hosts ([[phab:T110055|T110055]])
* 14:34 wm-bot2: drained, depooled and removed k8s control node tools-k8s-control-3 ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 14:30 wm-bot2: removed instance tools-k8s-control-2 - cookbook ran by taavi@runko


=== 2018-11-02 ===
=== 2023-04-05 ===
* 18:11 arturo: [[phab:T206223|T206223]] some disturbances due to the certificate renewal
* 15:16 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|5ea5992}}) - cookbook ran by taavi@runko
* 17:04 arturo: renewing *.wmflabs.org [[phab:T206223|T206223]]
* 15:10 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:3569803 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|3569803}}) - cookbook ran by taavi@runko
* 14:56 wm-bot2: Added a new k8s worker tools-k8s-worker-88.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:42 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:42 wm-bot2: Added a new k8s worker tools-k8s-worker-87.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:28 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:28 wm-bot2: Added a new k8s worker tools-k8s-worker-86.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:15 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:15 wm-bot2: Added a new k8s worker tools-k8s-worker-85.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:01 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 14:01 wm-bot2: Added a new k8s worker tools-k8s-worker-84.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:47 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:47 wm-bot2: Added a new k8s worker tools-k8s-worker-83.tools.eqiad1.wikimedia.cloud to the cluster ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:34 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:33 wm-bot2: removed instance tools-k8s-worker-83 - cookbook ran by taavi@runko
* 13:15 wm-bot2: Adding a new k8s worker node ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:06 wm-bot2: removing grid node tools-sgeweblight-10-31.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:02 wm-bot2: removing grid node tools-sgeweblight-10-29.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 13:00 wm-bot2: removing grid node tools-sgeexec-10-9.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:58 wm-bot2: removing grid node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:54 wm-bot2: removing grid node tools-sgeexec-10-7.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:52 wm-bot2: removing grid node tools-sgeweblight-10-13.tools.eqiad1.wikimedia.cloud ([[phab:T333972|T333972]]) - cookbook ran by taavi@runko
* 12:34 wm-bot2: drained, depooled and removed k8s control node tools-k8s-control-1 - cookbook ran by taavi@runko
* 12:07 wm-bot2: Added a new k8s control tools-k8s-control-6.tools.eqiad1.wikimedia.cloud to the cluster - cookbook ran by taavi@runko
* 11:53 wm-bot2: Adding a new k8s control node - cookbook ran by taavi@runko
* 11:51 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:39 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 11:38 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:21 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 11:21 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 11:09 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 10:53 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 10:41 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 10:41 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
* 10:16 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko


=== 2018-10-31 ===
=== 2023-04-04 ===
* 18:02 gtirloni: truncated big .err and error.log files
* 19:00 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde
* 18:59 wm-bot2: removed instance tools-k8s-control-5 - cookbook ran by taavi@runko
* 18:46 wm-bot2: Adding a new k8s control node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 18:45 wm-bot2: Adding a new k8s CONTROL node ([[phab:T333929|T333929]]) - cookbook ran by taavi@runko
* 10:15 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 09:28 arturo: hard-reboot the 3 k8s control nodes


=== 2018-10-29 ===
=== 2023-04-03 ===
* 17:00 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]
* 17:13 wm-bot2: rebooted k8s node tools-k8s-worker-31 - cookbook ran by taavi@runko
* 17:11 wm-bot2: rebooted k8s node tools-k8s-worker-32 - cookbook ran by taavi@runko
* 17:09 wm-bot2: rebooted k8s node tools-k8s-worker-33 - cookbook ran by taavi@runko
* 17:07 wm-bot2: rebooted k8s node tools-k8s-worker-34 - cookbook ran by taavi@runko
* 17:05 wm-bot2: rebooted k8s node tools-k8s-worker-35 - cookbook ran by taavi@runko
* 17:04 wm-bot2: rebooted k8s node tools-k8s-worker-36 - cookbook ran by taavi@runko
* 17:02 wm-bot2: rebooted k8s node tools-k8s-worker-37 - cookbook ran by taavi@runko
* 17:00 wm-bot2: rebooted k8s node tools-k8s-worker-38 - cookbook ran by taavi@runko
* 16:58 wm-bot2: rebooted k8s node tools-k8s-worker-39 - cookbook ran by taavi@runko
* 16:56 wm-bot2: rebooted k8s node tools-k8s-worker-40 - cookbook ran by taavi@runko
* 16:55 wm-bot2: rebooted k8s node tools-k8s-worker-41 - cookbook ran by taavi@runko
* 16:53 wm-bot2: rebooted k8s node tools-k8s-worker-42 - cookbook ran by taavi@runko
* 16:51 wm-bot2: rebooted k8s node tools-k8s-worker-43 - cookbook ran by taavi@runko
* 16:49 wm-bot2: rebooted k8s node tools-k8s-worker-44 - cookbook ran by taavi@runko
* 16:45 wm-bot2: rebooted k8s node tools-k8s-worker-45 - cookbook ran by taavi@runko
* 16:43 wm-bot2: rebooted k8s node tools-k8s-worker-46 - cookbook ran by taavi@runko
* 16:41 wm-bot2: rebooted k8s node tools-k8s-worker-47 - cookbook ran by taavi@runko
* 16:40 wm-bot2: rebooted k8s node tools-k8s-worker-48 - cookbook ran by taavi@runko
* 16:38 wm-bot2: rebooted k8s node tools-k8s-worker-49 - cookbook ran by taavi@runko
* 16:36 wm-bot2: rebooted k8s node tools-k8s-worker-50 - cookbook ran by taavi@runko
* 16:35 wm-bot2: rebooted k8s node tools-k8s-worker-51 - cookbook ran by taavi@runko
* 16:33 wm-bot2: rebooted k8s node tools-k8s-worker-52 - cookbook ran by taavi@runko
* 16:31 wm-bot2: rebooted k8s node tools-k8s-worker-53 - cookbook ran by taavi@runko
* 16:28 wm-bot2: rebooted k8s node tools-k8s-worker-54 - cookbook ran by taavi@runko
* 16:27 wm-bot2: rebooted k8s node tools-k8s-worker-55 - cookbook ran by taavi@runko
* 16:25 wm-bot2: rebooted k8s node tools-k8s-worker-56 - cookbook ran by taavi@runko
* 16:23 wm-bot2: rebooted k8s node tools-k8s-worker-57 - cookbook ran by taavi@runko
* 16:21 wm-bot2: rebooted k8s node tools-k8s-worker-58 - cookbook ran by taavi@runko
* 16:20 wm-bot2: rebooted k8s node tools-k8s-worker-59 - cookbook ran by taavi@runko
* 16:18 wm-bot2: rebooted k8s node tools-k8s-worker-60 - cookbook ran by taavi@runko
* 16:09 wm-bot2: rebooted k8s node tools-k8s-worker-61 - cookbook ran by taavi@runko
* 16:07 wm-bot2: rebooted k8s node tools-k8s-worker-62 - cookbook ran by taavi@runko
* 16:01 wm-bot2: rebooted k8s node tools-k8s-worker-64 - cookbook ran by taavi@runko
* 16:00 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 15:58 wm-bot2: rebooted k8s node tools-k8s-worker-65 - cookbook ran by taavi@runko
* 15:56 wm-bot2: rebooted k8s node tools-k8s-worker-66 - cookbook ran by taavi@runko
* 15:48 wm-bot2: rebooted k8s node tools-k8s-worker-67 - cookbook ran by taavi@runko
* 15:38 wm-bot2: rebooted k8s node tools-k8s-worker-68 - cookbook ran by taavi@runko
* 15:36 wm-bot2: rebooted k8s node tools-k8s-worker-69 - cookbook ran by taavi@runko
* 15:34 wm-bot2: rebooted k8s node tools-k8s-worker-70 - cookbook ran by taavi@runko
* 15:32 wm-bot2: rebooted k8s node tools-k8s-worker-71 - cookbook ran by taavi@runko
* 15:30 wm-bot2: rebooted k8s node tools-k8s-worker-72 - cookbook ran by taavi@runko
* 15:28 wm-bot2: rebooted k8s node tools-k8s-worker-73 - cookbook ran by taavi@runko
* 15:26 wm-bot2: rebooted k8s node tools-k8s-worker-74 - cookbook ran by taavi@runko
* 15:24 wm-bot2: rebooted k8s node tools-k8s-worker-75 - cookbook ran by taavi@runko
* 15:22 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 15:17 wm-bot2: rebooted k8s node tools-k8s-worker-75 - cookbook ran by taavi@runko
* 15:14 wm-bot2: rebooted k8s node tools-k8s-worker-76 - cookbook ran by taavi@runko
* 15:12 wm-bot2: rebooted k8s node tools-k8s-worker-77 - cookbook ran by taavi@runko
* 15:10 wm-bot2: rebooted k8s node tools-k8s-worker-78 - cookbook ran by taavi@runko
* 15:08 wm-bot2: rebooted k8s node tools-k8s-worker-79 - cookbook ran by taavi@runko
* 15:06 wm-bot2: rebooted k8s node tools-k8s-worker-80 - cookbook ran by taavi@runko
* 14:59 wm-bot2: rebooted k8s node tools-k8s-worker-81 - cookbook ran by taavi@runko
* 14:41 wm-bot2: rebooted k8s node tools-k8s-worker-82 - cookbook ran by taavi@runko
* 14:38 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
* 14:13 andrewbogott: test log to see if stashbot is back working
* 13:19 andrewbogott: forcing puppet run on all toolforge VMs
* 08:28 taavi: stop exim4.service on tools-sgecron-2 [[phab:T333477|T333477]]
* 06:52 taavi: stop jobs-framework-emailer to prevent spam due to NFS being read-only [[phab:T333477|T333477]]


=== 2018-10-26 ===
=== 2023-03-29 ===
* 10:34 arturo: [[phab:T207970|T207970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
* 16:07 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|dc26f52}}) - cookbook ran by raymond@ubuntu
* 10:32 arturo: [[phab:T209970|T209970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
* 15:21 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/registry-admission:24115c7 from https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|24115c7}}) - cookbook ran by raymond@ubuntu


=== 2018-10-19 ===
=== 2023-03-28 ===
* 14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
* 19:43 wm-bot2: deployed kubernetes component https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|e1b9815}}) - cookbook ran by raymond@ubuntu
* 00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017


=== 2018-10-18 ===
=== 2023-03-27 ===
* 19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017
* 22:51 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:70d550a from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|70d550a}}) - cookbook ran by raymond@ubuntu


=== 2018-10-16 ===
=== 2023-03-26 ===
* 15:13 bd808: (repost for gtirloni) [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (leftover from [[phab:T165624|T165624]] legofan4000->macfan4000 rename)
* 20:28 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko


=== 2018-10-07 ===
=== 2023-03-24 ===
* 21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 [[phab:T194859|T194859]]
* 14:13 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@endurance
* 21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
* 21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens


=== 2018-09-21 ===
=== 2023-03-21 ===
* 12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
* 08:11 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 12:14 arturo: [[phab:T205078|T205078]] same for {jessie,stretch}-wikimedia
* 12:12 arturo: [[phab:T205078|T205078]] upgrade trusty-wikimedia packages (git-fat, debmonitor)
* 11:57 arturo: [[phab:T205078|T205078]] purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines


=== 2018-09-17 ===
=== 2023-03-20 ===
* 09:13 arturo: [[phab:T204481|T204481]] aborrero@tools-mail:~$ sudo exiqgrep -i {{!}} xargs sudo exim -Mrm
* 13:39 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 10:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@endurance


=== 2018-09-14 ===
=== 2023-03-19 ===
* 11:22 arturo: [[phab:T204267|T204267]] stop the corhist tool (k8s) because is hammering the wikidata API
* 09:32 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 10:51 arturo: [[phab:T204267|T204267]] stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API


=== 2018-09-08 ===
=== 2023-03-17 ===
* 10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog ([[phab:T196137|T196137]])
* 15:56 andrewbogott: truncating .out, .err, and .log files to 10MB in anticipation of moving the NFS volumes


=== 2018-09-07 ===
=== 2023-03-13 ===
* 05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb
* 09:50 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:f90bd8f from https://github.com/toolforge/buildpack-admission-controller ({{Gerrit|f90bd8f}}) - cookbook ran by dcaro@vulcanus


=== 2018-08-27 ===
=== 2023-03-12 ===
* 23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
* 13:40 taavi: restart haproxy on tools-k8s-haproxy-3
* 23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
* 22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]


=== 2018-08-22 ===
=== 2023-03-11 ===
* 13:02 arturo: I used this command: `sudo exim -bp {{!}} sudo exiqgrep -i {{!}} xargs sudo exim -Mrm`
* 18:38 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com
* 18:36 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:34 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:31 taavi: reboot misbehaving tools-sgeexec-10-11


=== 2018-08-19 ===
=== 2023-03-10 ===
* 09:12 legoktm: rebuilding python/base k8s images for https://gerrit.wikimedia.org/r/453665 ([[phab:T202218|T202218]])
* 16:36 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|8b42b15}}) - cookbook ran by taavi@runko


=== 2018-08-14 ===
=== 2023-03-09 ===
* 21:02 legoktm: rebuilt php7.2 docker images for https://gerrit.wikimedia.org/r/452755
* 10:13 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|53e7f81}}) - cookbook ran by taavi@runko
* 01:08 legoktm: switched tools.coverme and tools.wikiinfo to use PHP 7.2
* 10:04 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/maintain-kubeusers:834807c from https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|834807c}}) - cookbook ran by taavi@runko


=== 2018-08-13 ===
=== 2023-03-08 ===
* 23:31 legoktm: rebuilding docker images for webservice upgrade
* 22:31 bd808: Live hacked user-maintainer clusterrole to work around breakage in [[phab:T331572|T331572]]
* 23:16 legoktm: published toollabs-webservice_0.41_all.deb
* 23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice


=== 2018-08-09 ===
=== 2023-03-07 ===
* 10:40 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-backports (excluding python-designateclient)
* 11:34 wm-bot2: Increased quotas by 2 volumes - cookbook ran by fran@wmf3169
* 10:30 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-wikimedia
* 11:09 wm-bot2: Increased quotas by 6 snapshots - cookbook ran by fran@wmf3169
* 10:27 arturo: [[phab:T201602|T201602]] upgrade packages from trusty-updates
* 11:07 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169


=== 2018-08-08 ===
=== 2023-03-06 ===
* 10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images [[phab:T156626|T156626]] [[phab:T148872|T148872]] [[phab:T158244|T158244]]
* 12:51 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|6688477}}) - cookbook ran by taavi@runko
* 12:33 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/registry-admission:e916fee from https://gerrit.wikimedia.org/r/labs/tools/registry-admission-webhook ({{Gerrit|e916fee}}) - cookbook ran by taavi@runko
* 12:16 arturo: delete calico deployment, redeploy from https://gitlab.wikimedia.org/repos/cloud/toolforge/calico ([[phab:T328539|T328539]])


=== 2018-08-06 ===
=== 2023-03-05 ===
* 12:33 arturo: [[phab:T197176|T197176]] installing texlive-full in toolforge
* 15:43 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|3e04025}}) - cookbook ran by taavi@runko


=== 2018-08-01 ===
=== 2023-03-02 ===
* 14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break
* 11:32 arturo: aborrero@tools-k8s-control-2:~$ sudo -i kubectl apply -f /etc/kubernetes/toolforge-tool-roles.yaml (https://gerrit.wikimedia.org/r/c/operations/puppet/+/889836)


=== 2018-07-30 ===
=== 2023-03-01 ===
* 20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
* 13:18 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|13eda9d}}) - cookbook ran by taavi@runko
* 04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools


=== 2018-07-27 ===
=== 2023-02-28 ===
* 04:52 zhuyifei1999_: rebuilding python/base docker container [[phab:T190274|T190274]]
* 17:19 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|9252af7}}) - cookbook ran by taavi@runko
* 17:04 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e46da83}}) - cookbook ran by taavi@runko


=== 2018-07-25 ===
=== 2023-02-23 ===
* 19:02 chasemp: tools-worker-1004 reboot
* 18:07 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway ({{Gerrit|efb60b3}}) - cookbook ran by taavi@runko
* 19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)
* 09:33 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/buildpack-admission:b34e2f8 from https://github.com/toolforge/buildpack-admission-controller.git ({{Gerrit|b34e2f8}}) - cookbook ran by taavi@runko


=== 2018-07-18 ===
=== 2023-02-21 ===
* 13:24 arturo: upgrading packages from `stretch-wikimedia` [[phab:T199905|T199905]]
* 09:37 arturo: hard-reboot tools-sgeexec-10-11 (unresponsive to ssh)
* 13:18 arturo: upgrading packages from `stable` [[phab:T199905|T199905]]
* 12:51 arturo: upgrading packages from `oldstable` [[phab:T199905|T199905]]
* 12:31 arturo: upgrading packages from `trusty-updates` [[phab:T199905|T199905]]
* 12:16 arturo: upgrading packages from `jessie-wikimedia` [[phab:T199905|T199905]]
* 12:09 arturo: upgrading packages from `trusty-wikimedia` [[phab:T199905|T199905]]


=== 2018-06-30 ===
=== 2023-02-20 ===
* 18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
* 11:24 taavi: redeploy volume-admission with helm and cert-manager certificates [[phab:T329530|T329530]] [[phab:T292238|T292238]]
* 16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
* 11:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:7fd13ac from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|ede8bd0}}) - cookbook ran by taavi@runko
* 16:39 zhuyifei1999_: reboot tools-paws-master-01
* 11:05 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-volume-admission-controller:7fd13ac from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|7fd13ac}}) - cookbook ran by taavi@runko
* 16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
* 10:39 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere
* 09:20 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo


=== 2018-06-29 ===
=== 2023-02-19 ===
* 17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
* 09:16 taavi: uncordon tools-k8s-worker-[80-82] after fixing security groups [[phab:T329378|T329378]]
* 17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU ([[phab:T123121|T123121]])
* 16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. [[phab:T182070|T182070]]


=== 2018-06-28 ===
=== 2023-02-17 ===
* 19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
* 11:32 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|eeeea4c}}) - cookbook ran by arturo@endurance
* 18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
* 11:31 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|7729b18}}) ([[phab:T254636|T254636]]) - cookbook ran by arturo@endurance
* 17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
* 11:26 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8a9b97e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|eeeea4c}}) - cookbook ran by arturo@endurance
* 17:20 chasemp: tools-worker-1007:~# /sbin/reboot
* 11:24 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8a9b97e from https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-framework-api ({{Gerrit|618ab29}}) - cookbook ran by arturo@endurance
* 16:48 arturo: rebooting tools-docker-registry-01
* 10:25 arturo: build and push mariadb-sssd/base docker image for Toolforge ([[phab:T320178|T320178]], [[phab:T254636|T254636]])
* 16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
* 16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck


=== 2018-06-21 ===
=== 2023-02-16 ===
* 13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash
* 15:58 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 15:30 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/cert-manager ({{Gerrit|d71994e}}) - cookbook ran by arturo@nostromo
* 13:52 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|7191997}}) - cookbook ran by taavi@runko
* 13:44 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/ingress-admission:1fe8ec4 from https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|1fe8ec4}}) - cookbook ran by taavi@runko
* 12:47 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/ingress-admission:e9b9920 from https://gerrit.wikimedia.org/r/cloud/toolforge/ingress-admission-controller ({{Gerrit|e9b9920}}) - cookbook ran by taavi@runko
* 10:35 arturo: aborrero@tools-k8s-control-1:~$ sudo -i kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml
* 09:48 arturo: grid engine was failed over to shadow server, manually put it back into normal https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Grid#GridEngine_Master
* 09:39 arturo: aborrero@tools-sgegrid-shadow:~$ sudo truncate -s 1G /var/log/syslog (was 17G, full root disk)


=== 2018-06-20 ===
=== 2023-02-15 ===
* 15:09 bd808: Killed orphan processes on webgrid nodes ([[phab:T182070|T182070]]); most owned by jembot and croptool
* 18:03 taavi: deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/889585/ to increase amount of haproxy max connections
* 15:19 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
* 09:50 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/cert-manager.git ({{Gerrit|e3f3ce1}}) ([[phab:T329453|T329453]]) - cookbook ran by taavi@runko
* 09:30 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo


=== 2018-06-14 ===
=== 2023-02-14 ===
* 14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash
* 15:07 taavi: import cert-manager components to local docker registry [[phab:T329453|T329453]]
* 12:12 arturo: the fixed webservicemonitor is starting a bunch of grid webservices ([[phab:T329611|T329611]])
* 12:10 arturo: included tools-manifests 0.25 in tools-buster aptly repo, deploying it now! ([[phab:T329611|T329611]], [[phab:T329467|T329467]], [[phab:T244809|T244809]])


=== 2018-06-11 ===
=== 2023-02-13 ===
* 10:11 arturo: [[phab:T196137|T196137]] `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null {{!}} grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart {{!}}{{!}} true'`
* 16:05 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169
* 16:03 taavi: update maintain-kubeusers deployment to use helm
* 15:05 taavi: deploy jobs-api updates, improving some status messages
* 15:04 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|13d87c4}}) - cookbook ran by taavi@runko
* 15:00 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:390ed64 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|390ed64}}) - cookbook ran by taavi@runko
* 13:14 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/maintain-kubeusers:aac195b from https://gerrit.wikimedia.org/r/labs/tools/maintain-kubeusers ({{Gerrit|aac195b}}) - cookbook ran by taavi@runko


=== 2018-06-08 ===
=== 2023-02-10 ===
* 07:46 arturo: [[phab:T196137|T196137]] more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes
* 15:45 taavi: reboot tools-k8s-worker-82 to troubleshoot network issues
* 12:44 wm-bot2: Added a new k8s worker tools-k8s-worker-82.tools.eqiad1.wikimedia.cloud to the worker pool ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 12:31 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 12:29 wm-bot2: Added a new k8s worker tools-k8s-worker-81.tools.eqiad1.wikimedia.cloud to the worker pool ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 12:15 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:53 wm-bot2: Adding a new k8s worker node ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:44 wm-bot2: removing grid node tools-sgeweblight-10-23.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:42 wm-bot2: removing grid node tools-sgeexec-10-5.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:39 wm-bot2: removing grid node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:26 wm-bot2: removing grid node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko
* 11:24 wm-bot2: removing grid node tools-sgeexec-10-1.tools.eqiad1.wikimedia.cloud ([[phab:T329357|T329357]]) - cookbook ran by taavi@runko


=== 2018-06-07 ===
=== 2023-02-01 ===
* 11:01 arturo: [[phab:T196137|T196137]] force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`
* 16:03 taavi: deployed tools-webservice 0.89
* 15:43 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|372037f}}) - cookbook ran by taavi@runko


=== 2018-06-06 ===
=== 2023-01-26 ===
* 22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt ([[phab:T196589|T196589]])
* 15:05 taavi: drain and reboot tools-k8s-worker-74 which seems to have some issues with nfs
* 21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
* 14:37 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|307f302}}) - cookbook ran by taavi@runko
* 20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
* 14:30 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:05966c6 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|05966c6}}) - cookbook ran by taavi@runko
* 19:04 chasemp: tools-bastion-03 is virtually unusable
* 09:49 arturo: [[phab:T196137|T196137]] aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid


=== 2018-06-05 ===
=== 2023-01-24 ===
* 18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben ([[phab:T196486|T196486]])
* 12:04 taavi: deploying toolforge-jobs-framework-cli v10 [[phab:T327775|T327775]]
* 17:39 arturo: [[phab:T196137|T196137]] clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
* 10:07 taavi: publish toolforge-jobs-framework-cli v9
* 17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs ([[phab:T196486|T196486]])


=== 2018-06-04 ===
=== 2023-01-23 ===
* 10:28 arturo: [[phab:T196006|T196006]] installing sqlite3 package in exec nodes
* 11:31 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d5ae229}}) - cookbook ran by taavi@runko
* 11:23 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:d085c50 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d085c50}}) - cookbook ran by taavi@runko
* 11:17 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config ({{Gerrit|864171a}}) - cookbook ran by taavi@runko


=== 2018-06-03 ===
=== 2023-01-20 ===
* 10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' [[phab:T195834|T195834]]
* 23:24 andrewbogott: truncating logfiles with find . -name '*.err'  -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 21:24 andrewbogott: truncating logfiles with find . -name '*.out' -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 01:06 andrewbogott: truncating logfiles with find . -name '*.log' -size +1G -exec truncate --size=100M <nowiki>{</nowiki><nowiki>}</nowiki> \;


=== 2018-05-31 ===
=== 2023-01-19 ===
* 11:31 zhuyifei1999_: building & pushing python/web docker image [[phab:T174769|T174769]]
* 11:46 arturo: `aborrero@tools-k8s-control-1:~$ sudo -i kubectl delete clusterrolebinding jobs-api-psp` (cleanup unused stuff)
* 11:13 zhuyifei1999_: force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101


=== 2018-05-30 ===
=== 2023-01-18 ===
* 10:52 zhuyifei1999_: undid both changes to tools-bastion-05
* 15:42 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0ad4c66}}) - cookbook ran by arturo@nostromo
* 10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
* 15:29 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:54cc15e from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|54cc15e}}) - cookbook ran by arturo@nostromo
* 10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close [[phab:T195834|T195834]]


=== 2018-05-28 ===
=== 2023-01-17 ===
* 12:09 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
* 13:55 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8cf38a1}}) - cookbook ran by arturo@endurance
* 12:06 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for trusty-wikimedia
* 13:51 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0d0a882}}) - cookbook ran by arturo@endurance
* 13:34 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:3a58c1d from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|3a58c1d}}) - cookbook ran by arturo@endurance


=== 2018-05-25 ===
=== 2023-01-10 ===
* 05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty [[phab:T195558|T195558]]
* 11:55 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8e0a2f9}}) - cookbook ran by arturo@endurance
* 11:52 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:9514b00 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8e0a2f9}}) - cookbook ran by arturo@endurance
* 11:36 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0243967}}) - cookbook ran by arturo@endurance


=== 2018-05-22 ===
=== 2023-01-03 ===
* 11:53 arturo: running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for [[phab:T194665|T194665]] (mono framework update)
* 17:17 andrewbogott: find -name '*.log'  -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;


=== 2018-05-18 ===
=== 2022-12-20 ===
* 16:36 bd808: Restarted bigbrother on tools-services-02
* 09:07 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo


=== 2018-05-16 ===
=== 2022-12-12 ===
* 21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds
* 14:36 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2018-05-15 ===
=== 2022-12-09 ===
* 04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414. It's hanging for unknown reasons.
* 07:20 taavi: change the canonical tools-mail external hostname to use mail.tools.wmcloud.org and add valid spf to toolforge.org [[phab:T324809|T324809]]
* 04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
* 04:05 zhuyifei1999_: Force deletion of grid job {{Gerrit|5221417}} (tools.giftbot sga), host tools-exec-1414 not responding


=== 2018-05-12 ===
=== 2022-12-05 ===
* 10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop {{!}} [[phab:T194343|T194343]]
* 11:06 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2018-05-11 ===
=== 2022-11-30 ===
* 14:34 andrewbogott: repooling labvirt1001 tools instances
* 10:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|bc3529d}}) - cookbook ran by arturo@nostromo
* 13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for [[phab:T194258|T194258]]:  tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
* 10:17 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:c360d54 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c360d54}}) - cookbook ran by arturo@nostromo


=== 2018-05-10 ===
=== 2022-11-29 ===
* 18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update
* 19:52 taavi: clear puppet failure emails from exim queues


=== 2018-05-09 ===
=== 2022-11-09 ===
* 21:11 Reedy: Added Tim Starling as member/admin
* 08:58 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo


=== 2018-05-07 ===
=== 2022-11-05 ===
* 21:02 zhuyifei1999_: re-building all docker images [[phab:T190893|T190893]]
* 19:28 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.err'  -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 [[phab:T190893|T190893]]
* 13:26 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.log'  -size +1G -exec truncate --size=1G <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours


=== 2018-05-05 ===
=== 2022-11-04 ===
* 23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing
* 20:41 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.err' -not -newermt "Nov 1, 2021" -exec rm <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 14:02 andrewbogott: cleaning up nfs share with  root@labstore1004:/srv/tools/shared/tools# find -name '*.log' -not -newermt "Nov 1, 2021" -exec rm <nowiki>{</nowiki><nowiki>}</nowiki> \;
* 12:20 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d464be4}}) ([[phab:T304900|T304900]]) - cookbook ran by arturo@nostromo
* 12:12 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:2b800f5 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|2b800f5}}) ([[phab:T304900|T304900]]) - cookbook ran by arturo@nostromo


=== 2018-05-03 ===
=== 2022-11-01 ===
* 14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package [[phab:T192566|T192566]]
* 09:37 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T322110|T322110]]) - cookbook ran by dcaro@vulcanus


=== 2018-05-01 ===
=== 2022-10-26 ===
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)
* 08:45 dcaro: depooling and rebooting tools-sgeexec-10-22 to get nfs scratch working again


=== 2018-04-27 ===
=== 2022-10-25 ===
* 18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
* 16:14 wm-bot2: Increased quotas by 5120 gigabytes - cookbook ran by fran@wmf3169
* 18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker
* 15:26 dcaro: pushed a newer docker-registry.tools.wmflabs.org/python:3.9-slim-bullseye (from upstream pthyon:3.9-slim-bullseye)


=== 2018-04-23 ===
=== 2022-10-20 ===
* 14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools [[phab:T192732|T192732]]
* 16:54 andrewbogott: rebooting tools-package-builder-04
* 16:49 andrewbogott: rebooting redis nodes (one at a time)
* 10:54 taavi: rebuild mono68-sssd image with the expired DST Root CA X3 removed [[phab:T311466|T311466]]


=== 2018-04-22 ===
=== 2022-10-18 ===
* 13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd {{!}} grep -E "    1 " {{!}} grep php-cgi {{!}} xargs sudo kill -9'`
* 11:52 taavi: deploy toolforge-jobs-framework-cli deb v8
* 10:30 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|64385e9}}) ([[phab:T320405|T320405]]) - cookbook ran by arturo@nostromo
* 10:27 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:9be2272 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|9be2272}}) - cookbook ran by taavi@runko
* 10:18 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|64385e9}}) ([[phab:T320405|T320405]]) - cookbook ran by arturo@nostromo


=== 2018-04-15 ===
=== 2022-10-17 ===
* 17:51 zhuyifei1999_: forced puppet puns across tools-elastic-0[1-3] [[phab:T192224|T192224]]
* 07:25 taavi: push updated perl532 images [[phab:T320824|T320824]]
* 17:45 zhuyifei1999_: granted elasticsearch credentials to tools.flaky-ci [[phab:T192224|T192224]]


=== 2018-04-11 ===
=== 2022-10-14 ===
* 13:25 chasemp: cleanup exim frozen messages in an effort to aleve queue pressure
* 07:54 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|0cc020e}}) ([[phab:T311466|T311466]]) - cookbook ran by taavi@runko


=== 2018-04-06 ===
=== 2022-10-13 ===
* 16:30 chicocvenancio: killed job in bastion, tools.gpy affected
* 15:10 arturo: restart jobs-emailer pod
* 14:30 arturo: add puppet class `toollabs::apt_pinning` to tools-puppetmaster-01 using horizon, to add some apt pinning related to [[phab:T159254|T159254]]
* 11:23 arturo: manually upgrade apache2 on tools-puppemaster for [[phab:T159254|T159254]]


=== 2018-04-05 ===
=== 2022-10-12 ===
* 18:46 chicocvenancio: killed wget that was hogging io
* 23:25 bd808: Rebuilding all Toolforge docker images ([[phab:T278436|T278436]], [[phab:T311466|T311466]], [[phab:T293552|T293552]])
* 20:43 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages. Third try seems to be working. ([[phab:T316554|T316554]])
* 20:31 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages after fixing bug in building the bullseye base image. ([[phab:T316554|T316554]])
* 16:26 dcaro: deploy the latest registry admission webhook, now for real (image tag {{Gerrit|07bc7db}})
* 12:48 dcaro: deploy the latest registry admission webhook (image tag {{Gerrit|07bc7db}})
* 09:26 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 09:19 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2018-03-29 ===
=== 2022-10-11 ===
* 20:09 chicocvenancio: killed interactive processes in tools-bastion-03
* 13:52 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:8574c36 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8574c36}}) - cookbook ran by taavi@runko
* 19:56 chicocvenancio: several interactive jobs running in bastion-03. I am writing to connected users and will kill the jobs once done


=== 2018-03-28 ===
=== 2022-10-10 ===
* 13:06 zhuyifei1999_: SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid
* 19:30 taavi: rebooting all k8s worker nodes to clean up labstore1006/7 remains
* 16:51 taavi: clean up labstore1006/7 mounts from k8s control nodes [[phab:T320425|T320425]]
* 11:35 arturo: aborrero@tools-k8s-control-1:~$ sudo -i kubectl -n jobs-emailer rollout restart deployment/jobs-emailer ([[phab:T317998|T317998]])
* 08:44 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|afa90ed}}) ([[phab:T320284|T320284]]) - cookbook ran by taavi@runko
* 08:39 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|afa90ed}}) - cookbook ran by taavi@runko


=== 2018-03-26 ===
=== 2022-10-09 ===
* 21:34 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
* 17:29 taavi: kill 10 idle tmux sessions of user 'hoi' on tools-sgebastion-10 [[phab:T320352|T320352]]


=== 2018-03-23 ===
=== 2022-10-07 ===
* 23:26 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
* 13:02 taavi: taavi@cloudcontrol1005 ~ $ sudo mark_tool --disable oncall # [[phab:T320240|T320240]]
* 19:43 bd808: tools-proxy-* Forced puppet run to apply https://gerrit.wikimedia.org/r/#/c/421472/


=== 2018-03-22 ===
=== 2022-10-06 ===
* 22:04 bd808: Forced puppet run on tools-proxy-02 for [[phab:T130748|T130748]]
* 00:39 bd808: Image rebuild failing with debian apt repo signature issue. Will investigate tomorrow. ([[phab:T316554|T316554]])
* 21:52 bd808: Forced puppet run on tools-proxy-01 for [[phab:T130748|T130748]]
* 00:36 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages. ([[phab:T316554|T316554]])
* 21:48 bd808: Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout
* 00:04 bd808: Building new php74-sssd-base & web images ([[phab:T310435|T310435]])
* 03:50 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'


=== 2018-03-21 ===
=== 2022-10-03 ===
* 17:50 bd808: Cleaned up stale /project/.system/bigbrother.scoreboard.* files from labstore1004
* 14:36 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller ({{Gerrit|8da432b}}) - cookbook ran by taavi@runko
* 01:09 bd808: Deleting /tmp files owned by tools.wsexport with -mtime +2 across grid ([[phab:T190185|T190185]])


=== 2018-03-20 ===
=== 2022-09-28 ===
* 08:28 zhuyifei1999_: unmount dumps & remount on tools-bastion-02 (can someone clush this?) [[phab:T189018|T189018]] [[phab:T190126|T190126]]
* 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # [[phab:T318858|T318858]]
* 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, [[phab:T318858|T318858]]
* 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)


=== 2018-03-19 ===
=== 2022-09-22 ===
* 11:02 arturo: reboot tools-exec-1408, to balance load. Server is unresponsive due to high load by some tools
* 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group [[phab:T317438|T317438]]
* 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy [[phab:T317438|T317438]]


=== 2018-03-16 ===
=== 2022-09-10 ===
* 22:44 zhuyifei1999_: suspended process 22825 (BotOrderOfChapters.exe) on tools-bastion-03. Threads continuously going to D-state & R-state. Also sent message via $ write on pts/10
* 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko
* 12:13 arturo: reboot tools-webgrid-lighttpd-1420 due to almost full /tmp


=== 2018-03-15 ===
=== 2022-09-07 ===
* 16:56 zhuyifei1999_: granted elasticsearch credentials to tools.denkmalbot [[phab:T185624|T185624]]
* 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks ([[phab:T316854|T316854]])


=== 2018-03-14 ===
=== 2022-09-06 ===
* 20:57 bd808: Upgrading elasticsearch on tools-elastic-01 ([[phab:T181531|T181531]])
* 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder ([[phab:T316854|T316854]])
* 20:53 bd808: Upgrading elasticsearch on tools-elastic-02 ([[phab:T181531|T181531]])
* 20:51 bd808: Upgrading elasticsearch on tools-elastic-03 ([[phab:T181531|T181531]])
* 12:07 arturo: reboot tools-webgrid-lighttpd-1415, almost full /tmp
* 12:01 arturo: repool tools-webgrid-lighttpd-1421, /tmp is now empty
* 11:56 arturo: depool tools-webgrid-lighttpd-1421 for reboot due to /tmp almost full


=== 2018-03-12 ===
=== 2022-08-25 ===
* 20:09 madhuvishy: Run clush -w @all -b 'sudo umount /mnt/nfs/labstore1003-scratch && sudo mount -a' to remount scratch across all of tools
* 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner [[phab:T293552|T293552]]
* 17:13 arturo: [[phab:T188994|T188994]] upgrading packages from `stable`
* 16:53 arturo: [[phab:T188994|T188994]] upgrading packages from stretch-wikimedia
* 16:33 arturo: [[phab:T188994|T188994]] upgrading packages form jessie-wikimedia
* 14:58 zhuyifei1999_: building, publishing, and deploying misctools 1.31 {{Gerrit|5f3561e}} [[phab:T189430|T189430]]
* 13:31 arturo: tools-exec-1441 and tools-exec-1442 rebooted fine and are repooled
* 13:26 arturo: depool tools-exec-1441 and tools-exec-1442 for reboots
* 13:19 arturo: [[phab:T188994|T188994]] upgrade packages from jessie-backports in all jessie servers
* 12:49 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-updates in all ubuntu servers
* 12:34 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-wikimedia in all ubuntu servers


=== 2018-03-08 ===
=== 2022-08-24 ===
* 16:05 chasemp: tools-clushmaster-01:~$ clush -g all 'sudo puppet agent --test'
* 12:20 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|eba66bc}}) - cookbook ran by taavi@runko
* 14:02 arturo: [[phab:T188994|T188994]] upgrading trusty-tools packages in all the cluster, this includes jobutils, openssh-server and openssh-sftp-server
* 12:20 taavi: upgrading ingress-nginx to v1.3


=== 2018-03-07 ===
=== 2022-08-20 ===
* 20:42 chicocvenancio: killed io intensive recursive zip of huge folder
* 07:44 dcaro_away: all k8s nodes ready now \o/ ([[phab:T315718|T315718]])
* 18:30 madhuvishy: Killed php-cgi job run by user 51242 on tools-webgrid-lighttpd-1413
* 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up ([[phab:T315718|T315718]])
* 14:08 arturo: just merged NFS package pinning https://gerrit.wikimedia.org/r/#/c/416943/
* 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking ([[phab:T315718|T315718]])
* 13:47 arturo: deploying more apt pinnings: https://gerrit.wikimedia.org/r/#/c/416934/


=== 2018-03-06 ===
=== 2022-08-18 ===
* 16:15 madhuvishy: Reboot tools-docker-registry-02 [[phab:T189018|T189018]]
* 14:45 andrewbogott: adding lucaswerkmeister  as projectadmin ([[phab:T314527|T314527]])
* 15:50 madhuvishy: Rebooting tools-worker-1011
* 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair
* 15:08 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs
* 15:03 arturo: drain and reboot tools-worker-1011
* 15:03 chasemp: rebooted tools-worker 1001-1008
* 14:58 arturo: drain and reboot tools-worker-1010
* 14:27 chasemp: multiple tools running on k8s workers report issues reading replica.my.cnf file atm
* 14:27 chasemp: reboot tools-worker-100[12]
* 14:23 chasemp: downtime icinga alert for k8s workers ready
* 13:21 arturo: [[phab:T188994|T188994]] in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced
* 12:58 arturo: [[phab:T188994|T188994]] upgrading packages in jessie nodes from the oldstable source
* 11:42 arturo: clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem
* 11:41 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide
* 11:36 arturo: (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic ([[phab:T188911|T188911]])
* 11:33 arturo: removing unused kernel packages in ubuntu nodes
* 11:08 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster


=== 2018-03-05 ===
=== 2022-08-17 ===
* 18:56 zhuyifei1999_: also published jobutils_1.30_all.deb
* 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # [[phab:T315459|T315459]]
* 18:39 zhuyifei1999_: built and published misctools_1.30_all.deb [[phab:T167026|T167026]] [[phab:T181492|T181492]]
* 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected
* 14:33 arturo: delete `linux-image-4.9.0-6-amd64` package from stretch instances for [[phab:T188911|T188911]]
* 14:01 arturo: deleting old kernel packages in jessie instances for [[phab:T188911|T188911]]
* 13:58 arturo: running `apt-get autoremove` with clush in all jessie instances
* 12:16 arturo: apply role::toollabs::base to tools-paws prefix in horizon for [[phab:T187193|T187193]]
* 12:10 arturo: apply role::toollabs::base to tools-prometheus prefix in horizon for [[phab:T187193|T187193]]


=== 2018-03-02 ===
=== 2022-08-16 ===
* 13:41 arturo: doing some testing with puppet classes in tools-package-builder-01 via horizon
* 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05


=== 2018-03-01 ===
=== 2022-08-11 ===
* 13:27 arturo: deploy https://gerrit.wikimedia.org/r/#/c/415057/
* 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues


=== 2018-02-27 ===
=== 2022-08-05 ===
* 17:37 chasemp: add chico as admin to toolsbeta
* 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 12:23 arturo: running `apt-get autoclean` in canary servers
* 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 12:16 arturo: running `apt-get autoremove` in canary servers
* 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2018-02-26 ===
=== 2022-08-03 ===
* 19:17 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --test"
* 15:51 dhinus: recreated jobs-api pods to pick up new ConfigMap
* 10:35 arturo: enable puppet in tools-proxy-01
* 15:02 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c47ac41}}) - cookbook ran by fran@MacBook-Pro.station
* 10:23 arturo: disable puppet in tools-proxy-01 for apt pinning tests


=== 2018-02-25 ===
=== 2022-07-20 ===
* 19:04 chicocvenancio: killed jobs in tools-bastion-03, wrote notice to tools owners' terminals
* 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
* 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2018-02-23 ===
=== 2022-07-19 ===
* 19:11 arturo: enable puppet in tools-proxy-01
* 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 18:53 arturo: disable puppet in tools-proxy-01 for apt preferences testing
* 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:52 arturo: deploying https://gerrit.wikimedia.org/r/#/c/413725/ across the fleet
* 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:04 arturo: install apt-rdepends in tools-paws-master-01 which triggered some python libs to be upgraded
* 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest


=== 2018-02-22 ===
=== 2022-07-17 ===
* 16:31 bstorm_: Enabled puppet on tools-static-12 as the test server
* 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2018-02-21 ===
=== 2022-07-14 ===
* 19:02 bstorm_: disabled puppet on tools-static-* pending change 413197
* 13:48 taavi: rebooting tools-sgeexec-10-2
* 18:15 arturo: puppet should be fine across the fleet
* 17:24 arturo: another try: merged https://gerrit.wikimedia.org/r/#/c/413202/
* 17:02 arturo: revert last change https://gerrit.wikimedia.org/r/#/c/413198/
* 16:59 arturo: puppet is broken across the cluster due to last change
* 16:57 arturo: deploying https://gerrit.wikimedia.org/r/#/c/410177/
* 16:26 bd808: Rebooting tools-docker-registry-01, NFS mounts are in a bad state
* 11:43 arturo: package upgrades in tools-webgrid-lightttpd-1401
* 11:35 arturo: package upgrades in tools-package-builder-01 tools-prometheus-01 tools-static-10 and tools-redis-1001
* 11:22 arturo: package upgrades in tools-mail, tools-grid-master, tool-logs-02
* 10:51 arturo: package upgrades in tools-checker-01 tools-clushmaster-01 and tools-docker-builder-05
* 09:18 chicocvenancio: killed io intensive tool job in bastion
* 03:32 zhuyifei1999_: removed /data/project/.elasticsearch.ini, owned by root and mode 644, leaks the creds of /data/project/strephit/.elasticsearch.ini Might need to cycle it as well...


=== 2018-02-20 ===
=== 2022-07-13 ===
* 12:42 arturo: upgrading tools-flannel-etcd-01
* 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 12:42 arturo: upgrading tools-k8s-etcd-01


=== 2018-02-19 ===
=== 2022-07-11 ===
* 19:13 arturo: upgrade all packages of tools-services-01
* 16:06 wm-bot2: Increased quotas by <nowiki>{</nowiki>self.increases<nowiki>}</nowiki> ([[phab:T312692|T312692]]) - cookbook ran by nskaggs@x1carbon
* 19:02 arturo: move tools-services-01 from puppet3 to puppet4 (manual package upgrade). No issues detected.
* 18:23 arturo: upgrade packages of tools-cron-01 from all channels (trusty-wikimedia, trusty-updates and trusty-tools)
* 12:54 arturo: puppet run with clush to ensure puppet is back to normal after being broken due to duplicated python3 declaration


=== 2018-02-16 ===
=== 2022-07-07 ===
* 18:21 arturo: upgrading tools-proxy-01 and tools-paws-master-01, same as others
* 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 17:36 arturo: upgrading oldstable, jessie-backports, jessie-wikimedia packages in tools-k8s-master-01 (excluding linux*, libpam*, nslcd)
* 13:00 arturo: upgrades in tools-exec-14[01-10].eqiad.wmflabs were fine
* 12:42 arturo: aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'DEBIAN_FRONTEND=noninteractive sudo apt-upgrade -u upgrade trusty-updates -y'
* 11:58 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-wikimedia -y
* 11:57 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-backports -y
* 11:53 arturo: (10 exec canary nodes) aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'sudo apt-upgrade -u upgrade trusty-wikimedia -y'
* 11:41 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade oldstable -y


=== 2018-02-15 ===
=== 2022-06-28 ===
* 13:54 arturo: cleanup ferm (deinstall) in tools-services-01 for [[phab:T187435|T187435]]
* 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T311538|T311538]]) - cookbook ran by dcaro@vulcanus
* 13:28 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-tools
* 15:51 taavi: add 4096G cinder quota [[phab:T311509|T311509]]
* 13:16 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-updates -y
* 13:13 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
* 13:06 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-tools
* 12:57 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-updates
* 12:51 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-wikimedia


=== 2018-02-14 ===
=== 2022-06-27 ===
* 13:09 arturo: the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment ([[phab:T187315|T187315]])
* 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
* 13:04 arturo: reboot tools-paws-master-01 for [[phab:T187315|T187315]]
* 18:02 taavi: switchover active cron server to tools-sgecron-2 [[phab:T284767|T284767]]
* 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:15 taavi: [[phab:T311412|T311412]] updating ca used by k8s-apiserver->etcd communication, breakage may happen
* 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 [[phab:T311412|T311412]]
* 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it [[phab:T311412|T311412]]


=== 2018-02-11 ===
=== 2022-06-23 ===
* 01:28 zhuyifei1999_: `# find /home/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected: only /home/tr8dr, mode 0777 -> 0775
* 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 01:21 zhuyifei1999_: `# find /data/project/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected tools: wikisource-tweets, gsociftttdev, dow, ifttt-testing, elobot. All mode 2777 -> 2775
* 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:59 taavi: removing remaining continuous jobs from the stretch grid [[phab:T277653|T277653]]


=== 2018-02-09 ===
=== 2022-06-22 ===
* 10:35 arturo: deploy https://gerrit.wikimedia.org/r/#/c/409226/ [[phab:T179343|T179343]] [[phab:T182562|T182562]] [[phab:T186846|T186846]]
* 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 06:15 bd808: Killed orphan processes owned by iabot, dupdet, and wsexport scattered across the webgrid nodes
* 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 05:07 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1426
* 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 05:06 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1411
* 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 05:05 bd808: Killed 1 orphan php-fcgi process from jembot that were running on tools-webgrid-lighttpd-1409
* 05:02 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1421 and pegging the cpu there
* 04:56 bd808: Rescheduled 30 of the 60 tools running on tools-webgrid-lighttpd-1421 ([[phab:T186830|T186830]])
* 04:39 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1417 and pegging the cpu there


=== 2018-02-08 ===
=== 2022-06-21 ===
* 18:38 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1002.tools.eqiad.wmflabs
* 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:35 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade jessie-wikimedia -v
* 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:33 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade oldstable -v
* 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:28 arturo: cordon & drain tools-worker-1002.tools.eqiad.wmflabs
* 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:10 arturo: uncordon tools-paws-worker-1019. Package upgrades were OK.
* 18:08 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stable -v
* 18:06 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stretch-wikimedia -v
* 18:02 arturo: cordon tools-paws-worker-1019 to do some package upgrades
* 17:29 arturo: repool tools-exec-1401.tools.eqiad.wmflabs. Package upgrades were OK.
* 17:20 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-updates -vy
* 17:15 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-wikimedia -vy
* 17:11 arturo: depool tools-exec-1401.tools.eqiad.wmflabs to do some package upgrades
* 14:22 arturo: it was some kind of transient error. After a second puppet run across the fleet, all seems fine
* 13:53 arturo: deploy https://gerrit.wikimedia.org/r/#/c/407465/ which is causing some puppet issues. Investigating.


=== 2018-02-06 ===
=== 2022-06-03 ===
* 13:15 arturo: deploy https://gerrit.wikimedia.org/r/#/c/408529/ to tools-services-01
* 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 13:05 arturo: unpublish/publish trusty-tools repo
* 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor [[phab:T309821|T309821]]
* 13:03 arturo: install aptly v0.9.6-1 in tools-services-01 for [[phab:T186539|T186539]] after adding it to trusty-tools repo (self contained)
* 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online [[phab:T309821|T309821]]
* 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
* 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor [[phab:T309821|T309821]]
* 15:50 balloons: temp add 1.0G swap to sgeweblight hosts [[phab:T309821|T309821]]
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
* 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
* 13:25 bd808: Upgrading fleet to tools-webservice 0.86 ([[phab:T309821|T309821]])
* 13:20 bd808: publish tools-webservice 0.86 ([[phab:T309821|T309821]])
* 12:46 taavi: start webservicemonitor on tools-sgecron-01 [[phab:T309821|T309821]]
* 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
* 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid [[phab:T309821|T309821]]
* 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
* 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package ([[phab:T309821|T309821]])
* 03:10 bd808: publish tools-webservice 0.85 with hack for [[phab:T309821|T309821]]


=== 2018-02-05 ===
=== 2022-06-02 ===
* 17:58 arturo: publishing/unpublishing trusty-tools repo in tools-services-01 to address [[phab:T186539|T186539]]
* 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
* 13:27 arturo: for the record, not a single warning or error (orange/red messages) in puppet in the toolforge cluster
* 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
* 13:06 arturo: deploying fix for [[phab:T186230|T186230]] using clush
* 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
* 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
* 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
* 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:03 dcaro: refresh prometheus certs ([[phab:T308402|T308402]])
* 11:47 dcaro: refresh registry-admission-controller certs ([[phab:T308402|T308402]])
* 11:42 dcaro: refresh ingress-admission-controller certs ([[phab:T308402|T308402]])
* 11:36 dcaro: refresh volume-admission-controller certs ([[phab:T308402|T308402]])
* 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2018-02-03 ===
=== 2022-06-01 ===
* 01:04 chicocvenancio: killed io intensive process in bastion-03 "vltools  python3 ./broken_ref_anchors.py"
* 11:18 taavi: depool and remove tools-sgeexec-09[07-14]


=== 2018-01-31 ===
=== 2022-05-31 ===
* 22:54 chasemp: add bstorm to sudoers as root
* 16:51 taavi: delete tools-sgeexec-0904 for [[phab:T309525|T309525]] experimentation


=== 2018-01-29 ===
=== 2022-05-30 ===
* 20:02 chasemp: add zhuyifei1999_ tools root for  [[phab:T185577|T185577]]
* 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) [[phab:T277653|T277653]]
* 20:01 chasemp: blast a puppet run to see if any errors are persistent


=== 2018-01-28 ===
=== 2022-05-26 ===
* 22:49 chicocvenancio: killed compromised session generating miner processes
* 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T309146|T309146]]) - cookbook ran by taavi@runko
* 22:48 chicocvenancio: killed miner processes in tools-bastion-03


=== 2018-01-27 ===
=== 2022-05-22 ===
* 00:55 arturo: at tools-static-11 the kernel OOM killer stopped git gc at about 20% :-(
* 17:04 taavi: failover tools-redis to the updated cluster [[phab:T278541|T278541]]
* 00:25 arturo: (/srv is almost full) aborrero@tools-static-11:/srv/cdnjs$ sudo git gc --aggressive
* 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud ([[phab:T308982|T308982]]) - cookbook ran by taavi@runko


=== 2018-01-25 ===
=== 2022-05-16 ===
* 23:47 arturo: fix last deprecation warnings in tools-elastic-03, tools-elastic-02, tools-proxy-01 and tools-proxy-02 by replacing by hand configtimeout with http_configtimeout in /etc/puppet/puppet.conf
* 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|7037eca}}) - cookbook ran by taavi@runko
* 23:20 arturo: [[phab:T179386|T179386]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 05:25 arturo: deploying misctools and jobutils 1.29 for [[phab:T179386|T179386]]


=== 2018-01-23 ===
=== 2022-05-14 ===
* 19:41 madhuvishy: Add bstorm to project admins
* 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940
* 15:48 bd808: Admin clean up; removed Coren, Ryan Lane, and Springle.
* 14:17 chasemp: add me, arturo, chico to sudoers and removed marc


=== 2018-01-22 ===
=== 2022-05-12 ===
* 18:32 arturo: [[phab:T181948|T181948]] [[phab:T185314|T185314]] deploying jobutils and misctools v1.28 in the cluster
* 12:36 taavi: re-enable CronJobControllerV2 [[phab:T308205|T308205]]
* 11:21 arturo: puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790
* 09:28 taavi: deploy jobs-api update [[phab:T308204|T308204]]
* 10:31 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v' <--- check again how is the cluster with puppet
* 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T308204|T308204]]) - cookbook ran by taavi@runko
* 10:18 arturo: [[phab:T181948|T181948]] deploy misctools 1.27 in the cluster


=== 2018-01-19 ===
=== 2022-05-10 ===
* 17:32 arturo: [[phab:T185314|T185314]] deploying new version of jobutils 1.27
* 15:18 taavi: depool tools-k8s-worker-42 for experiments
* 12:56 arturo: the puppet status across the fleet seems good, only minor things like [[phab:T185314|T185314]] , [[phab:T179388|T179388]] and [[phab:T179386|T179386]]
* 13:54 taavi: enable distro-wikimedia unattended upgrades [[phab:T290494|T290494]]
* 12:39 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'


=== 2018-01-18 ===
=== 2022-05-06 ===
* 16:11 arturo: aborrero@tools-clushmaster-01:~$ sudo aptitude purge vblade vblade-persist runit (for something similar to [[phab:T182781|T182781]])
* 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl ([[phab:T307812|T307812]])
* 15:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 13:52 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -f 1 -w @all 'sudo facter {{!}} grep lsbdistcodename {{!}} grep trusty && sudo apt-upgrade trusty-wikimedia -v'
* 13:44 chasemp: upgrade wikimedia packages on tools-bastion-05
* 12:24 arturo: [[phab:T178717|T178717]] aborrero@tools-exec-1401:~$ sudo apt-upgrade trusty-wikimedia -v
* 12:11 arturo: [[phab:T178717|T178717]] aborrero@tools-webgrid-generic-1402:~$ sudo apt-upgrade trusty-wikimedia -v
* 11:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'


=== 2018-01-17 ===
=== 2022-05-05 ===
* 18:47 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions {{!}} grep upgradeable {{!}} grep trusty-wikimedia' {{!}} tee pending-upgrades-report-trusty-wikimedia.txt
* 17:28 taavi: deploy tools-webservice 0.83 [[phab:T307693|T307693]]
* 17:55 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' {{!}} tee pending-upgrades-report.txt
* 15:15 andrewbogott: running purge-old-kernels on all Trusty exec nodes
* 15:15 andrewbogott: repooling exec-manage tools-exec-1430.
* 15:04 andrewbogott: depooling exec-manage tools-exec-1430.  Experimenting with purge-old-kernels
* 14:09 arturo: [[phab:T181647|T181647]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'


=== 2018-01-16 ===
=== 2022-05-03 ===
* 22:01 chasemp: qstat -explain E -xml {{!}} grep 'name' {{!}} sed 's/<name>//' {{!}} sed 's/<\/name>//'  {{!}} xargs qmod -cq
* 08:20 taavi: redis: start replication from the old cluster to the new one ([[phab:T278541|T278541]])
* 21:54 chasemp: tools-exec-1436:~$ /sbin/reboot
* 21:24 andrewbogott: repooled tools-exec-1420  and tools-webgrid-lighttpd-1417
* 21:14 andrewbogott: depooling tools-exec-1420  and tools-webgrid-lighttpd-1417
* 20:58 andrewbogott: depooling tools-exec-1412, 1415, 1417, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
* 20:56 andrewbogott: repooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: depooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: repooled tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:33 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:20 andrewbogott: depooling tools-webgrid-lighttpd-1412  and tools-exec-1423 for host reboot
* 20:19 andrewbogott: repooled tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:02 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:00 andrewbogott: depooled and repooled tools-webgrid-lighttpd-1427 tools-webgrid-lighttpd-1428 tools-exec-1413  tools-exec-1442 for host reboot
* 18:50 andrewbogott: switched active proxy back to tools-proxy-02
* 18:50 andrewbogott: repooling tools-exec-1422 and tools-webgrid-lighttpd-1413
* 18:34 andrewbogott: moving proxy from tools-proxy-02 to tools-proxy-01
* 18:31 andrewbogott: depooling tools-exec-1422 and tools-webgrid-lighttpd-1413 for host reboot
* 18:26 andrewbogott: repooling tools-exec-1404 and 1434 for host reboot
* 18:06 andrewbogott: depooling tools-exec-1404 and 1434 for host reboot
* 18:04 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:48 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:28 andrewbogott: disabling tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403, tools-exec-1403 for host reboot
* 17:26 andrewbogott: repooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 after host reboot
* 17:08 andrewbogott: depooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 for host reboot
* 16:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 after host reboot
* 15:52 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 for host reboot
* 13:35 chasemp: tools-mail  almouked@ltnet.net 719 pending messages cleared


=== 2018-01-11 ===
=== 2022-05-02 ===
* 20:33 andrewbogott: repooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 08:54 taavi: restart acme-chief.service [[phab:T307333|T307333]]
* 20:33 andrewbogott: uncordoning tools-worker-1012 and tools-worker-1017
* 20:06 andrewbogott: cordoning tools-worker-1012 and tools-worker-1017
* 20:02 andrewbogott: depooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 19:00 chasemp: reboot tools-worker-1015
* 15:08 chasemp: reboot tools-exec-1405
* 15:06 chasemp: reboot tools-exec-1404
* 15:06 chasemp: reboot tools-exec-1403
* 15:02 chasemp: reboot tools-exec-1402
* 14:57 chasemp: reboot tools-exec-1401 again...
* 14:53 chasemp: reboot tools-exec-1401
* 14:46 chasemp: install metltdown kernel and reboot workers 1011-1016 as jessie pilot


=== 2018-01-10 ===
=== 2022-04-25 ===
* 15:14 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 ([[phab:T214343|T214343]])
* 15:03 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1016`; do kubectl cordon $n; done
* 14:46 bd808: Building toolforge-webservice v0.82
* 14:41 chasemp: tools-clushmaster-01:~$ clush -w @k8s-worker "sudo puppet agent --disable 'chase rollout'"
* 14:01 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1001.tools.eqiad.wmflabs
* 13:57 arturo: [[phab:T184604|T184604]] cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs
* 13:46 arturo: [[phab:T184604|T184604]] aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs
* 13:45 arturo: [[phab:T184604|T184604]] aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes
* 13:26 arturo: sudo kubectl drain tools-worker-1020.tools.eqiad.wmflabs
* 13:22 arturo: empty by hand syslog and daemon.log files. They are so big that logrotate won't handle them
* 13:20 arturo: aborrero@tools-worker-1020:~$ sudo service kubelet restart
* 13:18 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for [[phab:T184604|T184604]]
* 13:13 arturo: detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened [[phab:T184604|T184604]]


=== 2018-01-09 ===
=== 2022-04-23 ===
* 23:21 yuvipanda: paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading
* 16:51 bd808: Built new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images and pushed to registry ([[phab:T214343|T214343]])
* 23:08 yuvipanda: turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1
* 23:01 yuvipanda: kill paws master and reboot it
* 22:54 yuvipanda: kill all kube-system pods in paws cluster
* 22:54 yuvipanda: kill all PAWS pods
* 22:53 yuvipanda: redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason
* 22:49 yuvipanda: run  clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8
* 22:48 yuvipanda: run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash'' to setup kubeadm on all paws worker nodes
* 22:46 yuvipanda: reboot all paws-worker nodes
* 22:46 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 22:46 madhuvishy: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 21:17 chasemp: ...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 21:17 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test"
* 21:10 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done
* 20:55 chasemp: for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016`; do kubectl cordon $n; done
* 20:51 chasemp: kubectl cordon tools-worker-1001.tools.eqiad.wmflabs
* 20:15 chasemp: disable puppet on proxies and k8s workers
* 19:50 chasemp: clush -w @all 'sudo puppet agent --test'
* 19:42 chasemp: reboot tools-worker-1010


=== 2018-01-08 ===
=== 2022-04-20 ===
* 20:34 madhuvishy: Restart kube services and uncordon tools-worker-1001
* 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?)
* 19:26 chasemp: sudo service docker restart; sudo service flannel restart; sudo service kube-proxy restart on tools-proxy-02
* 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8f37a04}}) - cookbook ran by taavi@runko


=== 2018-01-06 ===
=== 2022-04-16 ===
* 00:35 madhuvishy: Run `clush -w @paws-worker -b 'sudo iptables -L FORWARD'`
* 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics ({{Gerrit|2c485e9}}) - cookbook ran by taavi@runko
* 00:05 madhuvishy: Drain and cordon tools-worker-1001 (for debugging the dns outage)


=== 2018-01-05 ===
=== 2022-04-12 ===
* 23:49 madhuvishy: Run clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart' on tools-clushmaster-01
* 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' ([[phab:T305986|T305986]])
* 16:22 andrewbogott: moving tools-worker-1027 to labvirt1015 (CPU balancing)
* 21:27 bd808: Added komla to 'roots' sudoers policy ([[phab:T305986|T305986]])
* 16:01 andrewbogott: moving tools-worker-1017 to labvirt1017 (CPU balancing)
* 21:24 bd808: Add komla as projectadmin ([[phab:T305986|T305986]])
* 15:32 andrewbogott: moving tools-exec-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 15:18 andrewbogott: moving tools-exec-1411.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 15:02 andrewbogott: moving tools-exec-1440.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:47 andrewbogott: moving tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:25 andrewbogott: moving tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 13:46 andrewbogott: moving tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 05:33 andrewbogott: migrating tools-worker-1012 to labvirt1017 (CPU load balancing)


=== 2018-01-04 ===
=== 2022-04-10 ===
* 17:24 andrewbogott: rebooting tools-paws-worker-1019 to verify repair of [[phab:T184018|T184018]]
* 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)


=== 2018-01-03 ===
=== 2022-04-09 ===
* 15:38 bd808: Forced Puppet run on tools-services-01
* 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /
* 11:29 arturo: deploy https://gerrit.wikimedia.org/r/#/c/401716/ and https://gerrit.wikimedia.org/r/394101 using clush
 
=== 2022-04-08 ===
* 10:44 arturo: disabled debug mode on the k8s jobs-emailer component
 
=== 2022-04-05 ===
* 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 07:21 arturo: deploying toolforge-jobs-framework-cli v7
 
=== 2022-04-04 ===
* 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions
 
=== 2022-03-28 ===
* 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud ([[phab:T304816|T304816]]) - cookbook ran by arturo@nostromo
 
=== 2022-03-15 ===
* 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|084ee51}}) - cookbook ran by arturo@nostromo
* 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...)
 
=== 2022-03-14 ===
* 11:44 arturo: deploy jobs-framework-emailer {{Gerrit|9470a5f339fd5a44c97c69ce97239aef30f5ee41}} ([[phab:T286135|T286135]])
* 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo ([[phab:T297090|T297090]])
 
=== 2022-03-10 ===
* 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902
 
=== 2022-03-01 ===
* 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state ([[phab:T302702|T302702]])
* 12:11 dcaro: Cleared error state queues for sgeexec-0916 ([[phab:T302702|T302702]])
* 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand
 
=== 2022-02-28 ===
* 08:02 taavi: reboot sgeexec-0916
* 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /
 
=== 2022-02-17 ===
* 08:23 taavi: deleted tools-clushmaster-02
* 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access
 
=== 2022-02-16 ===
* 00:12 bd808: Image builds completed.
 
=== 2022-02-15 ===
* 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
* 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
* 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
* 22:50 bd808: Built new toollabs-webservice 0.81
* 18:43 bd808: Enabled puppet on tools-proxy-05
* 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
* 18:21 taavi: delete tools-package-builder-03
* 11:49 arturo: invalidate sssd cache in all bastions to debug [[phab:T301736|T301736]]
* 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for [[phab:T301736|T301736]]
* 11:15 arturo: reboot tools-sgebastion-10 for [[phab:T301736|T301736]]
 
=== 2022-02-10 ===
* 15:07 taavi: shutdown tools-clushmaster-02 [[phab:T298191|T298191]]
* 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally [[phab:T214427|T214427]]
* 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - [[phab:T214427|T214427]]
* 08:06 taavi: disable puppet globally for enabling puppetdb [[phab:T214427|T214427]]
 
=== 2022-02-09 ===
* 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet [[phab:T214427|T214427]]
* 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] ([[phab:T277653|T277653]]) - cookbook ran by arturo@nostromo
* 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 18:25 arturo: ignore last message
* 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 [[phab:T298191|T298191]]
 
=== 2022-02-07 ===
* 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository ([[phab:T288406|T288406]])
* 12:52 taavi: updated maintain-kubeusers for [[phab:T301081|T301081]]
 
=== 2022-02-04 ===
* 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with [[phab:T301015|T301015]]
* 21:36 taavi: clear error state from some webgrid nodes
 
=== 2022-02-03 ===
* 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
* 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate
 
=== 2022-01-30 ===
* 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover [[phab:T278541|T278541]]
* 14:22 taavi: creating a cluster of 3 bullseye redis hosts for [[phab:T278541|T278541]]
 
=== 2022-01-26 ===
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
* 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
* 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
* 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
* 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes ([[phab:T277653|T277653]])
 
=== 2022-01-25 ===
* 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 11:44 arturo: rebooting buster exec nodes
* 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4
 
=== 2022-01-24 ===
* 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 15:23 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])
 
=== 2022-01-20 ===
* 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
* 12:56 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])
 
=== 2022-01-19 ===
* 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move
 
=== 2022-01-14 ===
* 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, [[phab:T299243|T299243]]
 
=== 2022-01-12 ===
* 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
* 11:03 arturo: created puppet prefix 'tools-sgeweblig'
* 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'
 
=== 2022-01-04 ===
* 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
* 08:12 taavi: disable puppet & exim4 on [[phab:T298501|T298501]]


==Archives==
==Archives==
* [[/Archive 1|Archive 1]] (2013-2014)
* [[Nova Resource:Tools/SAL/Archive 1|Archive 1]] (2013-2014)
* [[/Archive 2|Archive 2]] (2015-2017)
* [[Nova Resource:Tools/SAL/Archive 2|Archive 2]] (2015-2017)
* [[Nova Resource:Tools/SAL/Archive 3|Archive 3]] (2018-2019)
* [[Nova Resource:Tools/SAL/Archive 4|Archive 4]] (2020-2021)
</noinclude>
</noinclude>
{{SAL|Project Name=tools}}
{{SAL|Project Name=tools}}
<noinclude>[[Category:SAL]]</noinclude>
<noinclude>[[Category:SAL]]</noinclude>

Latest revision as of 20:21, 8 June 2023

2023-06-08

  • 20:21 bd808: Rebuilding container images (T337897)
  • 14:16 dcaro: restart tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud due to nfs hiccup
  • 14:07 dcaro: restarting the tools-sgeexec-10-17 node due to nfs hiccup
  • 14:00 dcaro: restarting the tools-sgegrid-master node due to nfs hiccup
  • 12:00 dcaro: powering off tools-k8s-etcd-18 (T334644)
  • 07:18 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api (24e7828) - cookbook ran by taavi@runko

2023-06-07

2023-06-05

  • 07:53 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2023-06-01

2023-05-31

  • 02:38 andrewbogott: rebooted tools-sgeweblight-10-16, T337806

2023-05-30

  • 00:22 andrewbogott: rebooted tools-sgeweblight-10-30, oom
  • 00:16 andrewbogott: rebooted tools-sgeweblight-10-24, seems to be oom

2023-05-26

2023-05-24

  • 12:28 dcaro: deploy latest buildservice (T335865)
  • 12:28 dcaro: deploy latest buildservice (T336050)

2023-05-23

2023-05-22

  • 10:06 arturo: hard-reboot tools-sgeexec-10-18 (monitoring reporting it as down)

2023-05-19

  • 13:38 arturo: uncordon tools-k8s-worker-47/48/64/75
  • 08:46 bd808: Building new perl532-sssd/{base,web} images (T323522, T320904)

2023-05-17

  • 16:05 dcaro: release toolforge-cli 0.3.0 (T336225)
  • 12:48 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway (fa8ed2c) (T336225) - cookbook ran by dcaro@vulcanus
  • 12:48 wm-bot2: rebooted k8s node tools-k8s-worker-71 (T316544) - cookbook ran by dcaro@vulcanus
  • 12:45 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway (d1bb238) (T336225) - cookbook ran by dcaro@vulcanus
  • 12:43 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api (8d21314) - cookbook ran by dcaro@vulcanus
  • 10:54 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-buildpack-admission-controller:7199a9e from https://github.com/toolforge/buildpack-admission-controller (7199a9e) - cookbook ran by fran@wmf3169
  • 08:49 wm-bot2: rebooted k8s node tools-k8s-worker-55 (T316544) - cookbook ran by dcaro@vulcanus
  • 08:33 wm-bot2: rebooted k8s node tools-k8s-worker-64 (T316544) - cookbook ran by dcaro@vulcanus
  • 08:32 wm-bot2: rebooted k8s node tools-k8s-worker-75 (T316544) - cookbook ran by dcaro@vulcanus
  • 08:25 wm-bot2: rebooted k8s node tools-k8s-worker-74 (T316544) - cookbook ran by dcaro@vulcanus
  • 08:17 wm-bot2: rebooted k8s node tools-k8s-worker-61 (T316544) - cookbook ran by dcaro@vulcanus
  • 08:10 wm-bot2: rebooted k8s node tools-k8s-worker-70 (T316544) - cookbook ran by dcaro@vulcanus
  • 08:03 wm-bot2: rebooted k8s node tools-k8s-worker-66 (T316544) - cookbook ran by dcaro@vulcanus
  • 07:54 wm-bot2: rebooted k8s node tools-k8s-worker-72 (T316544) - cookbook ran by dcaro@vulcanus
  • 07:46 wm-bot2: rebooted k8s node tools-k8s-worker-47 (T316544) - cookbook ran by dcaro@vulcanus
  • 07:45 wm-bot2: rebooted k8s node tools-k8s-worker-48 (T316544) - cookbook ran by dcaro@vulcanus
  • 07:42 wm-bot2: rebooted k8s node tools-k8s-worker-69 (T316544) - cookbook ran by dcaro@vulcanus
  • 07:29 wm-bot2: rebooted k8s node tools-k8s-worker-76 (T316544) - cookbook ran by dcaro@vulcanus

2023-05-16

2023-05-15

  • 22:50 bd808: Rebuilding bullseye and buster docker containers to pick up make package addition (T320343)
  • 22:09 wm-bot2: rebooted k8s node tools-k8s-worker-66 (T316544) - cookbook ran by andrew@bullseye
  • 22:07 wm-bot2: rebooted k8s node tools-k8s-worker-65 (T316544) - cookbook ran by andrew@bullseye
  • 22:06 wm-bot2: rebooted k8s node tools-k8s-worker-64 (T316544) - cookbook ran by andrew@bullseye
  • 22:04 wm-bot2: rebooted k8s node tools-k8s-worker-62 (T316544) - cookbook ran by andrew@bullseye
  • 22:02 wm-bot2: rebooted k8s node tools-k8s-worker-61 (T316544) - cookbook ran by andrew@bullseye
  • 21:58 wm-bot2: rebooted k8s node tools-k8s-worker-60 (T316544) - cookbook ran by andrew@bullseye
  • 21:56 wm-bot2: rebooted k8s node tools-k8s-worker-59 (T316544) - cookbook ran by andrew@bullseye
  • 21:54 wm-bot2: rebooted k8s node tools-k8s-worker-58 (T316544) - cookbook ran by andrew@bullseye
  • 21:52 wm-bot2: rebooted k8s node tools-k8s-worker-57 (T316544) - cookbook ran by andrew@bullseye
  • 21:51 wm-bot2: rebooted k8s node tools-k8s-worker-56 (T316544) - cookbook ran by andrew@bullseye
  • 21:50 wm-bot2: rebooted k8s node tools-k8s-worker-55 (T316544) - cookbook ran by andrew@bullseye
  • 21:49 wm-bot2: rebooted k8s node tools-k8s-worker-54 (T316544) - cookbook ran by andrew@bullseye
  • 21:47 wm-bot2: rebooted k8s node tools-k8s-worker-53 (T316544) - cookbook ran by andrew@bullseye
  • 21:44 wm-bot2: rebooted k8s node tools-k8s-worker-52 (T316544) - cookbook ran by andrew@bullseye
  • 21:42 wm-bot2: rebooted k8s node tools-k8s-worker-51 (T316544) - cookbook ran by andrew@bullseye
  • 21:41 wm-bot2: rebooted k8s node tools-k8s-worker-50 (T316544) - cookbook ran by andrew@bullseye
  • 21:40 wm-bot2: rebooted k8s node tools-k8s-worker-49 (T316544) - cookbook ran by andrew@bullseye
  • 21:38 wm-bot2: rebooted k8s node tools-k8s-worker-48 (T316544) - cookbook ran by andrew@bullseye
  • 21:37 wm-bot2: rebooted k8s node tools-k8s-worker-47 (T316544) - cookbook ran by andrew@bullseye
  • 21:33 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by andrew@bullseye
  • 21:16 wm-bot2: rebooted k8s node tools-k8s-worker-45 (T316544) - cookbook ran by dcaro@vulcanus
  • 21:15 wm-bot2: rebooted k8s node tools-k8s-worker-44 (T316544) - cookbook ran by dcaro@vulcanus
  • 21:13 wm-bot2: rebooted k8s node tools-k8s-worker-43 (T316544) - cookbook ran by dcaro@vulcanus
  • 21:12 wm-bot2: rebooted k8s node tools-k8s-worker-42 (T316544) - cookbook ran by dcaro@vulcanus
  • 21:09 wm-bot2: rebooted k8s node tools-k8s-worker-41 (T316544) - cookbook ran by dcaro@vulcanus
  • 21:03 wm-bot2: rebooted k8s node tools-k8s-worker-40 (T316544) - cookbook ran by dcaro@vulcanus
  • 20:58 wm-bot2: rebooted k8s node tools-k8s-worker-39 (T316544) - cookbook ran by dcaro@vulcanus
  • 20:52 wm-bot2: rebooted k8s node tools-k8s-worker-38 (T316544) - cookbook ran by dcaro@vulcanus
  • 20:50 wm-bot2: rebooted k8s node tools-k8s-worker-37 (T316544) - cookbook ran by dcaro@vulcanus
  • 20:49 wm-bot2: rebooted k8s node tools-k8s-worker-36 (T316544) - cookbook ran by dcaro@vulcanus
  • 20:48 wm-bot2: rebooted k8s node tools-k8s-worker-35 (T316544) - cookbook ran by dcaro@vulcanus
  • 20:47 wm-bot2: rebooted k8s node tools-k8s-worker-34 (T316544) - cookbook ran by dcaro@vulcanus
  • 20:42 wm-bot2: rebooted k8s node tools-k8s-worker-33 (T316544) - cookbook ran by dcaro@vulcanus
  • 20:41 andrewbogott: rebooting frozen VMs: tools-k8s-worker-65, tools-sgeweblight-10-27, tools-k8s-worker-45, tools-k8s-worker-36, tools-sgewebgen-10-3 (fallout from earlier nfs outage)
  • 20:36 wm-bot2: rebooted k8s node tools-k8s-worker-32 (T316544) - cookbook ran by dcaro@vulcanus
  • 20:32 wm-bot2: rebooted k8s node tools-k8s-worker-31 (T316544) - cookbook ran by dcaro@vulcanus
  • 20:24 wm-bot2: rebooted k8s node tools-k8s-worker-30 (T316544) - cookbook ran by dcaro@vulcanus
  • 19:04 wm-bot2: rebooted k8s node tools-k8s-worker-67 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:56 wm-bot2: rebooted k8s node tools-k8s-worker-68 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:49 wm-bot2: rebooted k8s node tools-k8s-worker-69 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:46 bd808: Hard reboot tools-static-14 via Horizon per IRC report of unresponsive requests
  • 18:44 wm-bot2: rebooted k8s node tools-k8s-worker-70 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:42 wm-bot2: rebooted k8s node tools-k8s-worker-71 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:39 wm-bot2: rebooted k8s node tools-k8s-worker-72 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:34 wm-bot2: rebooted k8s node tools-k8s-worker-73 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:28 wm-bot2: rebooted k8s node tools-k8s-worker-74 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:22 wm-bot2: rebooted k8s node tools-k8s-worker-75 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:22 taavi: clear mail queue
  • 18:21 wm-bot2: rebooted k8s node tools-k8s-worker-76 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:15 wm-bot2: rebooted k8s node tools-k8s-worker-77 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:08 wm-bot2: rebooted k8s node tools-k8s-worker-80 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:06 wm-bot2: rebooted k8s node tools-k8s-worker-81 (T316544) - cookbook ran by dcaro@vulcanus
  • 18:05 wm-bot2: rebooted k8s node tools-k8s-worker-82 (T316544) - cookbook ran by dcaro@vulcanus
  • 17:57 wm-bot2: rebooted k8s node tools-k8s-worker-83 (T316544) - cookbook ran by dcaro@vulcanus
  • 17:48 wm-bot2: rebooted k8s node tools-k8s-worker-84 (T316544) - cookbook ran by dcaro@vulcanus
  • 17:47 wm-bot2: rebooted k8s node tools-k8s-worker-85 (T316544) - cookbook ran by dcaro@vulcanus
  • 17:38 wm-bot2: rebooted k8s node tools-k8s-worker-86 (T316544) - cookbook ran by dcaro@vulcanus
  • 17:37 wm-bot2: rebooted k8s node tools-k8s-worker-87 (T316544) - cookbook ran by dcaro@vulcanus
  • 17:35 wm-bot2: rebooted k8s node tools-k8s-worker-88 (T316544) - cookbook ran by dcaro@vulcanus
  • 17:34 wm-bot2: rebooting all the workers of tools k8s cluster (64 nodes) (T316544) - cookbook ran by dcaro@vulcanus
  • 17:20 wm-bot2: rebooted k8s node tools-k8s-worker-87 (T316544) - cookbook ran by dcaro@vulcanus
  • 17:19 wm-bot2: rebooted k8s node tools-k8s-worker-88 (T316544) - cookbook ran by dcaro@vulcanus
  • 17:17 bd808: Rebuilding bullseye and buster docker containers to pick up openssh-client package addition (T258841)
  • 17:12 wm-bot2: rebooting the whole tools k8s cluster (64 nodes) (T316544) - cookbook ran by dcaro@vulcanus
  • 17:06 dcaro: rebooting tools-sgegrid-shadow (T316544)
  • 17:00 dcaro: rebooting tools-sgegrid-master (T316544)
  • 16:55 dcaro: rebooting tools-sgeexec-10-20 (T316544)
  • 16:53 dcaro: rebooting tools-sgeweblight-10-18 (T316544)
  • 16:53 dcaro: rebooting tools-sgeweblight-10-25 (T316544)
  • 16:53 dcaro: rebooting tools-sgeweblight-10-20 (T316544)
  • 16:52 dcaro: rebooting tools-sgeweblight-10-21 (T316544)
  • 16:52 dcaro: rebooting tools-sgeexec-10-22 (T316544)
  • 16:51 dcaro: rebooting tools-sgeweblight-10-28 (T316544)
  • 16:50 dcaro: rebooting tools-sgeexec-10-17 (T316544)
  • 16:48 dcaro: rebooting tools-sgeexec-10-21 (T316544)
  • 16:47 dcaro: rebooting tools-sgeexec-10-19 (T316544)
  • 16:45 dcaro: rebooting tools-sgeexec-10-8 (T316544)
  • 16:45 dcaro: rebooting tools-sgeweblight-10-24 (T316544)
  • 16:44 dcaro: rebooting tools-sgewebgen-10-2 (T316544)
  • 16:44 dcaro: rebooting tools-sgeweblight-10-16 (T316544)
  • 16:43 dcaro: rebooting tools-sgeweblight-10-30 (T316544)
  • 16:43 dcaro: rebooting tools-sgeexec-10-18 (T316544)
  • 16:42 dcaro: rebooting tools-sgeexec-10-16 (T316544)
  • 16:42 dcaro: rebooting tools-sgeexec-10-14 (T316544)
  • 16:41 dcaro: rebooting tools-sgeweblight-10-32 (T316544)
  • 16:40 dcaro: rebooting tools-sgeweblight-10-22 (T316544)
  • 16:39 dcaro: rebooting tools-sgeweblight-10-17 (T316544)
  • 16:32 dcaro: rebooting tools-sgeexec-10-13.tools.eqiad1.wikimedia.cloud (T316544)
  • 16:23 dcaro: rebooting tools-sgeweblight-10-26 (T316544)
  • 16:15 bd808: Hard reboot of tools-sgebastion-11 via Horizon (done circa 16:11Z)
  • 16:14 arturo: rebooted a bunch of nodes to cleanup D procs and high load avg because NFS outage (result of T316544)
  • 12:36 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/builds-api:09f3b49-dev from https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-builds-api.git (32a8ae9) - cookbook ran by dcaro@vulcanus
  • 09:12 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/volume-admission:c64da5a from https://gerrit.wikimedia.org/r/cloud/toolforge/volume-admission-controller (c64da5a) - cookbook ran by dcaro@vulcanus

2023-05-13

  • 09:13 taavi: reboot tools-sgeexec-10-15,17,18,21

2023-05-11

  • 15:48 bd808: Rebooted tools-sgebastion-10 for T336510
  • 15:31 bd808: Sent `wall` for reboot of tools-sgebastion-10 circa 15:40Z

2023-05-09

2023-05-08

  • 09:12 arturo: force-reboot tools-sgeexec-10-13 (reported as down by the monitoring, no SSH)

2023-05-07

  • 16:06 taavi: remove inbound 25/tcp rule from the toolserver legacy server T136225

2023-05-05

2023-05-04

  • 15:15 wm-bot2: removed instance tools-k8s-etcd-15 - cookbook ran by andrew@bullseye
  • 14:13 wm-bot2: removed instance tools-k8s-etcd-14 - cookbook ran by andrew@bullseye

2023-05-03

  • 12:41 wm-bot2: removed instance tools-k8s-etcd-13 - cookbook ran by andrew@bullseye

2023-05-02

2023-05-01

2023-04-28

  • 15:01 arturo: force reboot tools-k8s-worker-79, unresponsive
  • 08:27 dcaro: rebooting tools-sgeweblight-10-28 (T335336)
  • 07:20 dcaro: rebooting tools-sgegrid-shadow due to stale nfs mount
  • 00:09 bd808: `kubectl uncordon tools-k8s-worker-67` (T335543)
  • 00:07 bd808: Hard reboot tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud via horizon (T335543)
  • 00:04 bd808: Rebooting tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud (T335543)

2023-04-27

  • 23:59 bd808: `kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-67` (T335543)
  • 20:50 bd808: Started process to rebuild all buster and bullseye based container images again. Prior problem seems to have been stale images in local cache on the build server.
  • 20:42 bd808: Container image rebuild failed with GPG errors in buster-sssd base image. Will investigate and attempt to restart once resolved in a local dev environment.
  • 20:33 bd808: Started process to rebuild all buster and bullseye based container images per https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images

2023-04-18

  • 16:46 dcaro: force-rebooting tools-sgeweblight-10-25/26/27 as they got stuck stopping the grid_exec process
  • 16:35 dcaro: rebooting root@tools-sgeweblight-10-27 due to stuck exec daemon not releasing port 6445
  • 16:35 dcaro: rebooting root@tools-sgeweblight-10-25 due to stuck exec daemon not releasing port 6445
  • 16:32 dcaro: rebooting root@tools-sgeweblight-10-26 due to stuck exec daemon not releasing port 6445
  • 16:26 dcaro: rebooting root@tools-sgeexec-10-14 due to stuck exec daemon not releasing port 6445

2023-04-17

  • 13:10 dcaro: rebooting tools-sgegrid-master node (T334847)
  • 02:43 legoktm: manual restart of apache2 on toolserver-proxy-1 to completely pick up renewed TLS cert (alert was flapping)

2023-04-11

2023-04-10

  • 10:46 taavi: patch existing PSP roles to use policy/v1beta1 T331619
  • 09:16 arturo: upgrading k8s cluster to 1.22 (T286856)

2023-04-07

  • 14:34 wm-bot2: drained, depooled and removed k8s control node tools-k8s-control-3 (T333929) - cookbook ran by taavi@runko
  • 14:30 wm-bot2: removed instance tools-k8s-control-2 - cookbook ran by taavi@runko

2023-04-05

  • 15:16 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api (5ea5992) - cookbook ran by taavi@runko
  • 15:10 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:3569803 from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api (3569803) - cookbook ran by taavi@runko
  • 14:56 wm-bot2: Added a new k8s worker tools-k8s-worker-88.tools.eqiad1.wikimedia.cloud to the cluster (T333972) - cookbook ran by taavi@runko
  • 14:42 wm-bot2: Adding a new k8s worker node (T333972) - cookbook ran by taavi@runko
  • 14:42 wm-bot2: Added a new k8s worker tools-k8s-worker-87.tools.eqiad1.wikimedia.cloud to the cluster (T333972) - cookbook ran by taavi@runko
  • 14:28 wm-bot2: Adding a new k8s worker node (T333972) - cookbook ran by taavi@runko
  • 14:28 wm-bot2: Added a new k8s worker tools-k8s-worker-86.tools.eqiad1.wikimedia.cloud to the cluster (T333972) - cookbook ran by taavi@runko
  • 14:15 wm-bot2: Adding a new k8s worker node (T333972) - cookbook ran by taavi@runko
  • 14:15 wm-bot2: Added a new k8s worker tools-k8s-worker-85.tools.eqiad1.wikimedia.cloud to the cluster (T333972) - cookbook ran by taavi@runko
  • 14:01 wm-bot2: Adding a new k8s worker node (T333972) - cookbook ran by taavi@runko
  • 14:01 wm-bot2: Added a new k8s worker tools-k8s-worker-84.tools.eqiad1.wikimedia.cloud to the cluster (T333972) - cookbook ran by taavi@runko
  • 13:47 wm-bot2: Adding a new k8s worker node (T333972) - cookbook ran by taavi@runko
  • 13:47 wm-bot2: Added a new k8s worker tools-k8s-worker-83.tools.eqiad1.wikimedia.cloud to the cluster (T333972) - cookbook ran by taavi@runko
  • 13:34 wm-bot2: Adding a new k8s worker node (T333972) - cookbook ran by taavi@runko
  • 13:33 wm-bot2: removed instance tools-k8s-worker-83 - cookbook ran by taavi@runko
  • 13:15 wm-bot2: Adding a new k8s worker node (T333972) - cookbook ran by taavi@runko
  • 13:06 wm-bot2: removing grid node tools-sgeweblight-10-31.tools.eqiad1.wikimedia.cloud (T333972) - cookbook ran by taavi@runko
  • 13:02 wm-bot2: removing grid node tools-sgeweblight-10-29.tools.eqiad1.wikimedia.cloud (T333972) - cookbook ran by taavi@runko
  • 13:00 wm-bot2: removing grid node tools-sgeexec-10-9.tools.eqiad1.wikimedia.cloud (T333972) - cookbook ran by taavi@runko
  • 12:58 wm-bot2: removing grid node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud (T333972) - cookbook ran by taavi@runko
  • 12:54 wm-bot2: removing grid node tools-sgeexec-10-7.tools.eqiad1.wikimedia.cloud (T333972) - cookbook ran by taavi@runko
  • 12:52 wm-bot2: removing grid node tools-sgeweblight-10-13.tools.eqiad1.wikimedia.cloud (T333972) - cookbook ran by taavi@runko
  • 12:34 wm-bot2: drained, depooled and removed k8s control node tools-k8s-control-1 - cookbook ran by taavi@runko
  • 12:07 wm-bot2: Added a new k8s control tools-k8s-control-6.tools.eqiad1.wikimedia.cloud to the cluster - cookbook ran by taavi@runko
  • 11:53 wm-bot2: Adding a new k8s control node - cookbook ran by taavi@runko
  • 11:51 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
  • 11:39 wm-bot2: Adding a new k8s control node (T333929) - cookbook ran by taavi@runko
  • 11:38 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
  • 11:21 wm-bot2: Adding a new k8s control node (T333929) - cookbook ran by taavi@runko
  • 11:21 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
  • 11:09 wm-bot2: Adding a new k8s control node (T333929) - cookbook ran by taavi@runko
  • 10:53 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
  • 10:41 wm-bot2: Adding a new k8s control node (T333929) - cookbook ran by taavi@runko
  • 10:41 wm-bot2: removed instance tools-k8s-control-6 - cookbook ran by taavi@runko
  • 10:16 wm-bot2: Adding a new k8s control node (T333929) - cookbook ran by taavi@runko

2023-04-04

  • 19:00 wm-bot2: Adding a new k8s control node (T333929) - cookbook ran by taavi@runko
  • 18:59 wm-bot2: removed instance tools-k8s-control-5 - cookbook ran by taavi@runko
  • 18:46 wm-bot2: Adding a new k8s control node (T333929) - cookbook ran by taavi@runko
  • 18:45 wm-bot2: Adding a new k8s CONTROL node (T333929) - cookbook ran by taavi@runko
  • 10:15 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo
  • 09:28 arturo: hard-reboot the 3 k8s control nodes

2023-04-03

  • 17:13 wm-bot2: rebooted k8s node tools-k8s-worker-31 - cookbook ran by taavi@runko
  • 17:11 wm-bot2: rebooted k8s node tools-k8s-worker-32 - cookbook ran by taavi@runko
  • 17:09 wm-bot2: rebooted k8s node tools-k8s-worker-33 - cookbook ran by taavi@runko
  • 17:07 wm-bot2: rebooted k8s node tools-k8s-worker-34 - cookbook ran by taavi@runko
  • 17:05 wm-bot2: rebooted k8s node tools-k8s-worker-35 - cookbook ran by taavi@runko
  • 17:04 wm-bot2: rebooted k8s node tools-k8s-worker-36 - cookbook ran by taavi@runko
  • 17:02 wm-bot2: rebooted k8s node tools-k8s-worker-37 - cookbook ran by taavi@runko
  • 17:00 wm-bot2: rebooted k8s node tools-k8s-worker-38 - cookbook ran by taavi@runko
  • 16:58 wm-bot2: rebooted k8s node tools-k8s-worker-39 - cookbook ran by taavi@runko
  • 16:56 wm-bot2: rebooted k8s node tools-k8s-worker-40 - cookbook ran by taavi@runko
  • 16:55 wm-bot2: rebooted k8s node tools-k8s-worker-41 - cookbook ran by taavi@runko
  • 16:53 wm-bot2: rebooted k8s node tools-k8s-worker-42 - cookbook ran by taavi@runko
  • 16:51 wm-bot2: rebooted k8s node tools-k8s-worker-43 - cookbook ran by taavi@runko
  • 16:49 wm-bot2: rebooted k8s node tools-k8s-worker-44 - cookbook ran by taavi@runko
  • 16:45 wm-bot2: rebooted k8s node tools-k8s-worker-45 - cookbook ran by taavi@runko
  • 16:43 wm-bot2: rebooted k8s node tools-k8s-worker-46 - cookbook ran by taavi@runko
  • 16:41 wm-bot2: rebooted k8s node tools-k8s-worker-47 - cookbook ran by taavi@runko
  • 16:40 wm-bot2: rebooted k8s node tools-k8s-worker-48 - cookbook ran by taavi@runko
  • 16:38 wm-bot2: rebooted k8s node tools-k8s-worker-49 - cookbook ran by taavi@runko
  • 16:36 wm-bot2: rebooted k8s node tools-k8s-worker-50 - cookbook ran by taavi@runko
  • 16:35 wm-bot2: rebooted k8s node tools-k8s-worker-51 - cookbook ran by taavi@runko
  • 16:33 wm-bot2: rebooted k8s node tools-k8s-worker-52 - cookbook ran by taavi@runko
  • 16:31 wm-bot2: rebooted k8s node tools-k8s-worker-53 - cookbook ran by taavi@runko
  • 16:28 wm-bot2: rebooted k8s node tools-k8s-worker-54 - cookbook ran by taavi@runko
  • 16:27 wm-bot2: rebooted k8s node tools-k8s-worker-55 - cookbook ran by taavi@runko
  • 16:25 wm-bot2: rebooted k8s node tools-k8s-worker-56 - cookbook ran by taavi@runko
  • 16:23 wm-bot2: rebooted k8s node tools-k8s-worker-57 - cookbook ran by taavi@runko
  • 16:21 wm-bot2: rebooted k8s node tools-k8s-worker-58 - cookbook ran by taavi@runko
  • 16:20 wm-bot2: rebooted k8s node tools-k8s-worker-59 - cookbook ran by taavi@runko
  • 16:18 wm-bot2: rebooted k8s node tools-k8s-worker-60 - cookbook ran by taavi@runko
  • 16:09 wm-bot2: rebooted k8s node tools-k8s-worker-61 - cookbook ran by taavi@runko
  • 16:07 wm-bot2: rebooted k8s node tools-k8s-worker-62 - cookbook ran by taavi@runko
  • 16:01 wm-bot2: rebooted k8s node tools-k8s-worker-64 - cookbook ran by taavi@runko
  • 16:00 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
  • 15:58 wm-bot2: rebooted k8s node tools-k8s-worker-65 - cookbook ran by taavi@runko
  • 15:56 wm-bot2: rebooted k8s node tools-k8s-worker-66 - cookbook ran by taavi@runko
  • 15:48 wm-bot2: rebooted k8s node tools-k8s-worker-67 - cookbook ran by taavi@runko
  • 15:38 wm-bot2: rebooted k8s node tools-k8s-worker-68 - cookbook ran by taavi@runko
  • 15:36 wm-bot2: rebooted k8s node tools-k8s-worker-69 - cookbook ran by taavi@runko
  • 15:34 wm-bot2: rebooted k8s node tools-k8s-worker-70 - cookbook ran by taavi@runko
  • 15:32 wm-bot2: rebooted k8s node tools-k8s-worker-71 - cookbook ran by taavi@runko
  • 15:30 wm-bot2: rebooted k8s node tools-k8s-worker-72 - cookbook ran by taavi@runko
  • 15:28 wm-bot2: rebooted k8s node tools-k8s-worker-73 - cookbook ran by taavi@runko
  • 15:26 wm-bot2: rebooted k8s node tools-k8s-worker-74 - cookbook ran by taavi@runko
  • 15:24 wm-bot2: rebooted k8s node tools-k8s-worker-75 - cookbook ran by taavi@runko
  • 15:22 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
  • 15:17 wm-bot2: rebooted k8s node tools-k8s-worker-75 - cookbook ran by taavi@runko
  • 15:14 wm-bot2: rebooted k8s node tools-k8s-worker-76 - cookbook ran by taavi@runko
  • 15:12 wm-bot2: rebooted k8s node tools-k8s-worker-77 - cookbook ran by taavi@runko
  • 15:10 wm-bot2: rebooted k8s node tools-k8s-worker-78 - cookbook ran by taavi@runko
  • 15:08 wm-bot2: rebooted k8s node tools-k8s-worker-79 - cookbook ran by taavi@runko
  • 15:06 wm-bot2: rebooted k8s node tools-k8s-worker-80 - cookbook ran by taavi@runko
  • 14:59 wm-bot2: rebooted k8s node tools-k8s-worker-81 - cookbook ran by taavi@runko
  • 14:41 wm-bot2: rebooted k8s node tools-k8s-worker-82 - cookbook ran by taavi@runko
  • 14:38 wm-bot2: rebooting the whole tools k8s cluster (58 nodes) - cookbook ran by taavi@runko
  • 14:13 andrewbogott: test log to see if stashbot is back working
  • 13:19 andrewbogott: forcing puppet run on all toolforge VMs
  • 08:28 taavi: stop exim4.service on tools-sgecron-2 T333477
  • 06:52 taavi: stop jobs-framework-emailer to prevent spam due to NFS being read-only T333477

2023-03-29

2023-03-28

2023-03-27

2023-03-26

  • 20:28 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko

2023-03-24

  • 14:13 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@endurance

2023-03-21

  • 08:11 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko

2023-03-20

  • 13:39 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
  • 10:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@endurance

2023-03-19

  • 09:32 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko

2023-03-17

  • 15:56 andrewbogott: truncating .out, .err, and .log files to 10MB in anticipation of moving the NFS volumes

2023-03-13

2023-03-12

  • 13:40 taavi: restart haproxy on tools-k8s-haproxy-3

2023-03-11

  • 18:38 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 18:36 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 18:34 wm-bot2: removing grid node tools-sgeexec-10-11.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 18:31 taavi: reboot misbehaving tools-sgeexec-10-11

2023-03-10

2023-03-09

2023-03-08

  • 22:31 bd808: Live hacked user-maintainer clusterrole to work around breakage in T331572

2023-03-07

  • 11:34 wm-bot2: Increased quotas by 2 volumes - cookbook ran by fran@wmf3169
  • 11:09 wm-bot2: Increased quotas by 6 snapshots - cookbook ran by fran@wmf3169
  • 11:07 wm-bot2: Increased quotas by 4000 gigabytes - cookbook ran by fran@wmf3169

2023-03-06

2023-03-05

2023-03-02

2023-03-01

2023-02-28

2023-02-23

2023-02-21

  • 09:37 arturo: hard-reboot tools-sgeexec-10-11 (unresponsive to ssh)

2023-02-20

2023-02-19

  • 09:16 taavi: uncordon tools-k8s-worker-[80-82] after fixing security groups T329378

2023-02-17

2023-02-16

2023-02-15

2023-02-14

  • 15:07 taavi: import cert-manager components to local docker registry T329453
  • 12:12 arturo: the fixed webservicemonitor is starting a bunch of grid webservices (T329611)
  • 12:10 arturo: included tools-manifests 0.25 in tools-buster aptly repo, deploying it now! (T329611, T329467, T244809)

2023-02-13

2023-02-10

  • 15:45 taavi: reboot tools-k8s-worker-82 to troubleshoot network issues
  • 12:44 wm-bot2: Added a new k8s worker tools-k8s-worker-82.tools.eqiad1.wikimedia.cloud to the worker pool (T329357) - cookbook ran by taavi@runko
  • 12:31 wm-bot2: Adding a new k8s worker node (T329357) - cookbook ran by taavi@runko
  • 12:29 wm-bot2: Added a new k8s worker tools-k8s-worker-81.tools.eqiad1.wikimedia.cloud to the worker pool (T329357) - cookbook ran by taavi@runko
  • 12:15 wm-bot2: Adding a new k8s worker node (T329357) - cookbook ran by taavi@runko
  • 11:53 wm-bot2: Adding a new k8s worker node (T329357) - cookbook ran by taavi@runko
  • 11:44 wm-bot2: removing grid node tools-sgeweblight-10-23.tools.eqiad1.wikimedia.cloud (T329357) - cookbook ran by taavi@runko
  • 11:42 wm-bot2: removing grid node tools-sgeexec-10-5.tools.eqiad1.wikimedia.cloud (T329357) - cookbook ran by taavi@runko
  • 11:39 wm-bot2: removing grid node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud (T329357) - cookbook ran by taavi@runko
  • 11:26 wm-bot2: removing grid node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud (T329357) - cookbook ran by taavi@runko
  • 11:24 wm-bot2: removing grid node tools-sgeexec-10-1.tools.eqiad1.wikimedia.cloud (T329357) - cookbook ran by taavi@runko

2023-02-01

2023-01-26

2023-01-24

  • 12:04 taavi: deploying toolforge-jobs-framework-cli v10 T327775
  • 10:07 taavi: publish toolforge-jobs-framework-cli v9

2023-01-23

2023-01-20

  • 23:24 andrewbogott: truncating logfiles with find . -name '*.err' -size +1G -exec truncate --size=100M {} \;
  • 21:24 andrewbogott: truncating logfiles with find . -name '*.out' -size +1G -exec truncate --size=100M {} \;
  • 01:06 andrewbogott: truncating logfiles with find . -name '*.log' -size +1G -exec truncate --size=100M {} \;

2023-01-19

  • 11:46 arturo: `aborrero@tools-k8s-control-1:~$ sudo -i kubectl delete clusterrolebinding jobs-api-psp` (cleanup unused stuff)

2023-01-18

2023-01-17

2023-01-10

2023-01-03

  • 17:17 andrewbogott: find -name '*.log' -size +1G -exec truncate --size=1G {} \;

2022-12-20

  • 09:07 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo

2022-12-12

  • 14:36 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-12-09

  • 07:20 taavi: change the canonical tools-mail external hostname to use mail.tools.wmcloud.org and add valid spf to toolforge.org T324809

2022-12-05

  • 11:06 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-11-30

2022-11-29

  • 19:52 taavi: clear puppet failure emails from exim queues

2022-11-09

  • 08:58 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by arturo@nostromo

2022-11-05

  • 19:28 andrewbogott: cleaning up nfs share with root@labstore1004:/srv/tools/shared/tools# find -name '*.err' -size +1G -exec truncate --size=1G {} \;
  • 13:26 andrewbogott: cleaning up nfs share with root@labstore1004:/srv/tools/shared/tools# find -name '*.log' -size +1G -exec truncate --size=1G {} \;

2022-11-04

2022-11-01

  • 09:37 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master (T322110) - cookbook ran by dcaro@vulcanus

2022-10-26

  • 08:45 dcaro: depooling and rebooting tools-sgeexec-10-22 to get nfs scratch working again

2022-10-25

  • 16:14 wm-bot2: Increased quotas by 5120 gigabytes - cookbook ran by fran@wmf3169
  • 15:26 dcaro: pushed a newer docker-registry.tools.wmflabs.org/python:3.9-slim-bullseye (from upstream pthyon:3.9-slim-bullseye)

2022-10-20

  • 16:54 andrewbogott: rebooting tools-package-builder-04
  • 16:49 andrewbogott: rebooting redis nodes (one at a time)
  • 10:54 taavi: rebuild mono68-sssd image with the expired DST Root CA X3 removed T311466

2022-10-18

2022-10-17

  • 07:25 taavi: push updated perl532 images T320824

2022-10-14

2022-10-13

  • 15:10 arturo: restart jobs-emailer pod

2022-10-12

  • 23:25 bd808: Rebuilding all Toolforge docker images (T278436, T311466, T293552)
  • 20:43 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages. Third try seems to be working. (T316554)
  • 20:31 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages after fixing bug in building the bullseye base image. (T316554)
  • 16:26 dcaro: deploy the latest registry admission webhook, now for real (image tag 07bc7db)
  • 12:48 dcaro: deploy the latest registry admission webhook (image tag 07bc7db)
  • 09:26 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
  • 09:19 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-10-11

2022-10-10

2022-10-09

  • 17:29 taavi: kill 10 idle tmux sessions of user 'hoi' on tools-sgebastion-10 T320352

2022-10-07

  • 13:02 taavi: taavi@cloudcontrol1005 ~ $ sudo mark_tool --disable oncall # T320240

2022-10-06

  • 00:39 bd808: Image rebuild failing with debian apt repo signature issue. Will investigate tomorrow. (T316554)
  • 00:36 bd808: Rebuilding all Toolforge docker images to pick up bug and security fix packages. (T316554)
  • 00:04 bd808: Building new php74-sssd-base & web images (T310435)

2022-10-03

2022-09-28

  • 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # T318858
  • 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, T318858
  • 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)

2022-09-22

  • 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group T317438
  • 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy T317438

2022-09-10

  • 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko

2022-09-07

  • 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks (T316854)

2022-09-06

  • 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder (T316854)

2022-08-25

  • 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner T293552

2022-08-24

2022-08-20

  • 07:44 dcaro_away: all k8s nodes ready now \o/ (T315718)
  • 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up (T315718)
  • 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking (T315718)

2022-08-18

  • 14:45 andrewbogott: adding lucaswerkmeister as projectadmin (T314527)
  • 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair

2022-08-17

  • 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # T315459
  • 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected

2022-08-16

  • 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05

2022-08-11

  • 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
  • 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues

2022-08-05

  • 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-08-03

2022-07-20

  • 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
  • 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-07-19

  • 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest

2022-07-17

  • 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-07-14

  • 13:48 taavi: rebooting tools-sgeexec-10-2

2022-07-13

  • 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-07-11

  • 16:06 wm-bot2: Increased quotas by {self.increases} (T312692) - cookbook ran by nskaggs@x1carbon

2022-07-07

  • 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-06-28

  • 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master (T311538) - cookbook ran by dcaro@vulcanus
  • 15:51 taavi: add 4096G cinder quota T311509

2022-06-27

  • 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
  • 18:02 taavi: switchover active cron server to tools-sgecron-2 T284767
  • 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:15 taavi: T311412 updating ca used by k8s-apiserver->etcd communication, breakage may happen
  • 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 T311412
  • 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it T311412

2022-06-23

  • 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 13:59 taavi: removing remaining continuous jobs from the stretch grid T277653

2022-06-22

  • 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-21

  • 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-03

  • 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor T309821
  • 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online T309821
  • 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
  • 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor T309821
  • 15:50 balloons: temp add 1.0G swap to sgeweblight hosts T309821
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
  • 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
  • 13:25 bd808: Upgrading fleet to tools-webservice 0.86 (T309821)
  • 13:20 bd808: publish tools-webservice 0.86 (T309821)
  • 12:46 taavi: start webservicemonitor on tools-sgecron-01 T309821
  • 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
  • 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid T309821
  • 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
  • 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package (T309821)
  • 03:10 bd808: publish tools-webservice 0.85 with hack for T309821

2022-06-02

  • 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
  • 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
  • 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
  • 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
  • 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
  • 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:03 dcaro: refresh prometheus certs (T308402)
  • 11:47 dcaro: refresh registry-admission-controller certs (T308402)
  • 11:42 dcaro: refresh ingress-admission-controller certs (T308402)
  • 11:36 dcaro: refresh volume-admission-controller certs (T308402)
  • 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster T277653
  • 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster T277653
  • 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-06-01

  • 11:18 taavi: depool and remove tools-sgeexec-09[07-14]

2022-05-31

  • 16:51 taavi: delete tools-sgeexec-0904 for T309525 experimentation

2022-05-30

  • 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) T277653

2022-05-26

2022-05-22

  • 17:04 taavi: failover tools-redis to the updated cluster T278541
  • 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud (T308982) - cookbook ran by taavi@runko

2022-05-16

2022-05-14

  • 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940

2022-05-12

2022-05-10

  • 15:18 taavi: depool tools-k8s-worker-42 for experiments
  • 13:54 taavi: enable distro-wikimedia unattended upgrades T290494

2022-05-06

  • 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl (T307812)

2022-05-05

  • 17:28 taavi: deploy tools-webservice 0.83 T307693

2022-05-03

  • 08:20 taavi: redis: start replication from the old cluster to the new one (T278541)

2022-05-02

  • 08:54 taavi: restart acme-chief.service T307333

2022-04-25

  • 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 (T214343)
  • 14:46 bd808: Building toolforge-webservice v0.82

2022-04-23

  • 16:51 bd808: Built new perl532-sssd/{base,web} images and pushed to registry (T214343)

2022-04-20

2022-04-16

2022-04-12

  • 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' (T305986)
  • 21:27 bd808: Added komla to 'roots' sudoers policy (T305986)
  • 21:24 bd808: Add komla as projectadmin (T305986)

2022-04-10

  • 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)

2022-04-09

  • 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /

2022-04-08

  • 10:44 arturo: disabled debug mode on the k8s jobs-emailer component

2022-04-05

2022-04-04

2022-03-28

  • 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud (T304816) - cookbook ran by arturo@nostromo

2022-03-15

2022-03-14

  • 11:44 arturo: deploy jobs-framework-emailer 9470a5f (T286135)
  • 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo (T297090)

2022-03-10

  • 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902

2022-03-01

  • 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state (T302702)
  • 12:11 dcaro: Cleared error state queues for sgeexec-0916 (T302702)
  • 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand

2022-02-28

  • 08:02 taavi: reboot sgeexec-0916
  • 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /

2022-02-17

  • 08:23 taavi: deleted tools-clushmaster-02
  • 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access

2022-02-16

  • 00:12 bd808: Image builds completed.

2022-02-15

  • 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
  • 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
  • 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
  • 22:50 bd808: Built new toollabs-webservice 0.81
  • 18:43 bd808: Enabled puppet on tools-proxy-05
  • 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
  • 18:21 taavi: delete tools-package-builder-03
  • 11:49 arturo: invalidate sssd cache in all bastions to debug T301736
  • 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for T301736
  • 11:15 arturo: reboot tools-sgebastion-10 for T301736

2022-02-10

  • 15:07 taavi: shutdown tools-clushmaster-02 T298191
  • 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally T214427
  • 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - T214427
  • 08:06 taavi: disable puppet globally for enabling puppetdb T214427

2022-02-09

  • 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet T214427
  • 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] (T277653) - cookbook ran by arturo@nostromo
  • 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
  • 18:25 arturo: ignore last message
  • 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
  • 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 T298191

2022-02-07

  • 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository (T288406)
  • 12:52 taavi: updated maintain-kubeusers for T301081

2022-02-04

  • 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with T301015
  • 21:36 taavi: clear error state from some webgrid nodes

2022-02-03

  • 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
  • 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate

2022-01-30

  • 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover T278541
  • 14:22 taavi: creating a cluster of 3 bullseye redis hosts for T278541

2022-01-26

  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
  • 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
  • 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
  • 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
  • 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
  • 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
  • 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
  • 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
  • 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes (T277653)

2022-01-25

  • 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
  • 11:44 arturo: rebooting buster exec nodes
  • 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4

2022-01-24

  • 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
  • 15:23 arturo: scaling up the grid with 10 buster exec nodes (T277653)

2022-01-20

  • 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
  • 12:56 arturo: scaling up the grid with 10 buster exec nodes (T277653)

2022-01-19

  • 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move

2022-01-14

  • 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, T299243

2022-01-12

  • 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
  • 11:03 arturo: created puppet prefix 'tools-sgeweblig'
  • 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'

2022-01-04

  • 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
  • 08:12 taavi: disable puppet & exim4 on T298501

Archives