You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Nova Resource:Tools/SAL: Difference between revisions
Jump to navigation
Jump to search
imported>Stashbot (majavah: deploying volume-admission to tools, should not affect anything yet T279106) |
imported>Stashbot (wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko) |
||
(81 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
=== | === 2022-06-23 === | ||
* | * 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | ||
* 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 13:59 taavi: removing remaining continuous jobs from the stretch grid [[phab:T277653|T277653]] | |||
=== | === 2022-06-22 === | ||
* | * 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | ||
* 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
=== | === 2022-06-21 === | ||
* | * 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | ||
* 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
* 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko | |||
=== | === 2022-06-03 === | ||
* | * 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster | ||
* | * 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor [[phab:T309821|T309821]] | ||
* | * 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster | ||
* | * 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster | ||
* 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster | |||
* 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online [[phab:T309821|T309821]] | |||
* 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster | |||
* 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster | |||
* 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster | |||
* 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space | |||
* 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster | |||
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor [[phab:T309821|T309821]] | |||
* 15:50 balloons: temp add 1.0G swap to sgeweblight hosts [[phab:T309821|T309821]] | |||
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821 | |||
* 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821 | |||
* 13:25 bd808: Upgrading fleet to tools-webservice 0.86 ([[phab:T309821|T309821]]) | |||
* 13:20 bd808: publish tools-webservice 0.86 ([[phab:T309821|T309821]]) | |||
* 12:46 taavi: start webservicemonitor on tools-sgecron-01 [[phab:T309821|T309821]] | |||
* 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too | |||
* 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid [[phab:T309821|T309821]] | |||
* 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host | |||
* 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package ([[phab:T309821|T309821]]) | |||
* 03:10 bd808: publish tools-webservice 0.85 with hack for [[phab:T309821|T309821]] | |||
=== | === 2022-06-02 === | ||
* | * 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler | ||
* 21:56 bd808: Removed legacy "active_proxy_host" hiera setting | |||
* 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key | |||
* 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key | |||
* 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko | |||
* 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko | |||
* 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko | |||
* 12:03 dcaro: refresh prometheus certs ([[phab:T308402|T308402]]) | |||
* 11:47 dcaro: refresh registry-admission-controller certs ([[phab:T308402|T308402]]) | |||
* 11:42 dcaro: refresh ingress-admission-controller certs ([[phab:T308402|T308402]]) | |||
* 11:36 dcaro: refresh volume-admission-controller certs ([[phab:T308402|T308402]]) | |||
* 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko | |||
* 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster [[phab:T277653|T277653]] | |||
* 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster [[phab:T277653|T277653]] | |||
* 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko | |||
=== | === 2022-06-01 === | ||
* | * 11:18 taavi: depool and remove tools-sgeexec-09[07-14] | ||
=== | === 2022-05-31 === | ||
* 16: | * 16:51 taavi: delete tools-sgeexec-0904 for [[phab:T309525|T309525]] experimentation | ||
=== | === 2022-05-30 === | ||
* | * 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) [[phab:T277653|T277653]] | ||
=== | === 2022-05-26 === | ||
* | * 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T309146|T309146]]) - cookbook ran by taavi@runko | ||
=== | === 2022-05-22 === | ||
* | * 17:04 taavi: failover tools-redis to the updated cluster [[phab:T278541|T278541]] | ||
* | * 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud ([[phab:T308982|T308982]]) - cookbook ran by taavi@runko | ||
=== | === 2022-05-16 === | ||
* | * 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|7037eca}}) - cookbook ran by taavi@runko | ||
=== | === 2022-05-14 === | ||
* | * 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940 | ||
=== | === 2022-05-12 === | ||
* | * 12:36 taavi: re-enable CronJobControllerV2 [[phab:T308205|T308205]] | ||
* | * 09:28 taavi: deploy jobs-api update [[phab:T308204|T308204]] | ||
* 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T308204|T308204]]) - cookbook ran by taavi@runko | |||
=== | === 2022-05-10 === | ||
* | * 15:18 taavi: depool tools-k8s-worker-42 for experiments | ||
* 13:54 taavi: enable distro-wikimedia unattended upgrades [[phab:T290494|T290494]] | |||
* | |||
=== | === 2022-05-06 === | ||
* | * 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl ([[phab:T307812|T307812]]) | ||
=== | === 2022-05-05 === | ||
* | * 17:28 taavi: deploy tools-webservice 0.83 [[phab:T307693|T307693]] | ||
=== | === 2022-05-03 === | ||
* | * 08:20 taavi: redis: start replication from the old cluster to the new one ([[phab:T278541|T278541]]) | ||
=== | === 2022-05-02 === | ||
* | * 08:54 taavi: restart acme-chief.service [[phab:T307333|T307333]] | ||
=== | === 2022-04-25 === | ||
* | * 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 ([[phab:T214343|T214343]]) | ||
* | * 14:46 bd808: Building toolforge-webservice v0.82 | ||
=== | === 2022-04-23 === | ||
* | * 16:51 bd808: Built new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images and pushed to registry ([[phab:T214343|T214343]]) | ||
=== | === 2022-04-20 === | ||
* | * 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?) | ||
* 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8f37a04}}) - cookbook ran by taavi@runko | |||
=== | === 2022-04-16 === | ||
* | * 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics ({{Gerrit|2c485e9}}) - cookbook ran by taavi@runko | ||
=== | === 2022-04-12 === | ||
* | * 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' ([[phab:T305986|T305986]]) | ||
* | * 21:27 bd808: Added komla to 'roots' sudoers policy ([[phab:T305986|T305986]]) | ||
* 21:24 bd808: Add komla as projectadmin ([[phab:T305986|T305986]]) | |||
=== | === 2022-04-10 === | ||
* | * 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space) | ||
=== | === 2022-04-09 === | ||
* 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on / | |||
* 15:30 | |||
=== | === 2022-04-08 === | ||
* | * 10:44 arturo: disabled debug mode on the k8s jobs-emailer component | ||
=== | === 2022-04-05 === | ||
* | * 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo | ||
* | * 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo | ||
* | * 07:21 arturo: deploying toolforge-jobs-framework-cli v7 | ||
=== | === 2022-04-04 === | ||
* | * 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo | ||
* 16: | * 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo | ||
* 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions | |||
=== | === 2022-03-28 === | ||
* | * 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud ([[phab:T304816|T304816]]) - cookbook ran by arturo@nostromo | ||
=== | === 2022-03-15 === | ||
* 16: | * 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|084ee51}}) - cookbook ran by arturo@nostromo | ||
* | * 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...) | ||
=== | === 2022-03-14 === | ||
* | * 11:44 arturo: deploy jobs-framework-emailer {{Gerrit|9470a5f339fd5a44c97c69ce97239aef30f5ee41}} ([[phab:T286135|T286135]]) | ||
* | * 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo ([[phab:T297090|T297090]]) | ||
=== | === 2022-03-10 === | ||
* | * 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902 | ||
=== | === 2022-03-01 === | ||
* | * 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state ([[phab:T302702|T302702]]) | ||
* 12:11 dcaro: Cleared error state queues for sgeexec-0916 ([[phab:T302702|T302702]]) | |||
* 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand | |||
=== | === 2022-02-28 === | ||
* | * 08:02 taavi: reboot sgeexec-0916 | ||
* 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on / | |||
* | |||
=== | === 2022-02-17 === | ||
* | * 08:23 taavi: deleted tools-clushmaster-02 | ||
* 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access | |||
* | |||
=== | === 2022-02-16 === | ||
* | * 00:12 bd808: Image builds completed. | ||
=== | === 2022-02-15 === | ||
* | * 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over. | ||
* 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01 | |||
* 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81 | |||
* 22:50 bd808: Built new toollabs-webservice 0.81 | |||
* 18:43 bd808: Enabled puppet on tools-proxy-05 | |||
* 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes | |||
* 18:21 taavi: delete tools-package-builder-03 | |||
* 11:49 arturo: invalidate sssd cache in all bastions to debug [[phab:T301736|T301736]] | |||
* 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for [[phab:T301736|T301736]] | |||
* 11:15 arturo: reboot tools-sgebastion-10 for [[phab:T301736|T301736]] | |||
=== | === 2022-02-10 === | ||
* | * 15:07 taavi: shutdown tools-clushmaster-02 [[phab:T298191|T298191]] | ||
* 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo | |||
* 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo | |||
* 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo | |||
* 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo | |||
* 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo | |||
* 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo | |||
* 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo | |||
* 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally [[phab:T214427|T214427]] | |||
* 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - [[phab:T214427|T214427]] | |||
* 08:06 taavi: disable puppet globally for enabling puppetdb [[phab:T214427|T214427]] | |||
=== | === 2022-02-09 === | ||
* | * 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet [[phab:T214427|T214427]] | ||
* | * 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] ([[phab:T277653|T277653]]) - cookbook ran by arturo@nostromo | ||
* 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo | |||
* 18:25 arturo: ignore last message | |||
* 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo | |||
* 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 [[phab:T298191|T298191]] | |||
=== | === 2022-02-07 === | ||
* 17: | * 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository ([[phab:T288406|T288406]]) | ||
* 12:52 taavi: updated maintain-kubeusers for [[phab:T301081|T301081]] | |||
=== | === 2022-02-04 === | ||
* | * 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with [[phab:T301015|T301015]] | ||
* 21:36 taavi: clear error state from some webgrid nodes | |||
=== | === 2022-02-03 === | ||
* | * 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space | ||
* 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate | |||
* | |||
=== | === 2022-01-30 === | ||
* | * 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover [[phab:T278541|T278541]] | ||
* | * 14:22 taavi: creating a cluster of 3 bullseye redis hosts for [[phab:T278541|T278541]] | ||
=== | === 2022-01-26 === | ||
* 18:27 | * 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo | ||
* | * 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo | ||
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo | |||
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo | |||
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo | |||
* 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo | |||
* 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo | |||
* 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo | |||
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo | |||
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo | |||
* 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes ([[phab:T277653|T277653]]) | |||
=== | === 2022-01-25 === | ||
* | * 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo | ||
* | * 11:44 arturo: rebooting buster exec nodes | ||
* 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4 | |||
=== | === 2022-01-24 === | ||
* | * 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo | ||
* 15:23 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]]) | |||
=== | === 2022-01-20 === | ||
* | * 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records | ||
* | * 12:56 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]]) | ||
=== | === 2022-01-19 === | ||
* | * 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move | ||
=== | === 2022-01-14 === | ||
* | * 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, [[phab:T299243|T299243]] | ||
=== | === 2022-01-12 === | ||
* | * 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig` | ||
* 11:03 arturo: created puppet prefix 'tools-sgeweblig' | |||
* 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig' | |||
* | |||
* | |||
=== | === 2022-01-04 === | ||
* 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart | |||
* 08:12 taavi: disable puppet & exim4 on [[phab:T298501|T298501]] | |||
* | |||
* | |||
==Archives== | ==Archives== | ||
* [[/Archive 1|Archive 1]] (2013-2014) | * [[Nova Resource:Tools/SAL/Archive 1|Archive 1]] (2013-2014) | ||
* [[/Archive 2|Archive 2]] (2015-2017) | * [[Nova Resource:Tools/SAL/Archive 2|Archive 2]] (2015-2017) | ||
* [[Nova Resource:Tools/SAL/Archive 3|Archive 3]] (2018-2019) | |||
* [[Nova Resource:Tools/SAL/Archive 4|Archive 4]] (2020-2021) | |||
</noinclude> | </noinclude> | ||
{{SAL|Project Name=tools}} | {{SAL|Project Name=tools}} | ||
<noinclude>[[Category:SAL]]</noinclude> | <noinclude>[[Category:SAL]]</noinclude> |
Revision as of 17:51, 23 June 2022
2022-06-23
- 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 13:59 taavi: removing remaining continuous jobs from the stretch grid T277653
2022-06-22
- 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
2022-06-21
- 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
- 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
2022-06-03
- 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
- 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor T309821
- 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
- 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
- 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
- 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online T309821
- 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
- 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
- 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
- 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
- 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
- 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor T309821
- 15:50 balloons: temp add 1.0G swap to sgeweblight hosts T309821
- 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
- 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
- 13:25 bd808: Upgrading fleet to tools-webservice 0.86 (T309821)
- 13:20 bd808: publish tools-webservice 0.86 (T309821)
- 12:46 taavi: start webservicemonitor on tools-sgecron-01 T309821
- 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
- 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid T309821
- 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
- 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package (T309821)
- 03:10 bd808: publish tools-webservice 0.85 with hack for T309821
2022-06-02
- 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
- 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
- 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
- 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
- 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
- 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
- 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
- 12:03 dcaro: refresh prometheus certs (T308402)
- 11:47 dcaro: refresh registry-admission-controller certs (T308402)
- 11:42 dcaro: refresh ingress-admission-controller certs (T308402)
- 11:36 dcaro: refresh volume-admission-controller certs (T308402)
- 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
- 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster T277653
- 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster T277653
- 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
2022-06-01
- 11:18 taavi: depool and remove tools-sgeexec-09[07-14]
2022-05-31
- 16:51 taavi: delete tools-sgeexec-0904 for T309525 experimentation
2022-05-30
- 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) T277653
2022-05-26
- 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api (e6fa299) (T309146) - cookbook ran by taavi@runko
2022-05-22
- 17:04 taavi: failover tools-redis to the updated cluster T278541
- 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud (T308982) - cookbook ran by taavi@runko
2022-05-16
- 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx (7037eca) - cookbook ran by taavi@runko
2022-05-14
- 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940
2022-05-12
- 12:36 taavi: re-enable CronJobControllerV2 T308205
- 09:28 taavi: deploy jobs-api update T308204
- 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api (e6fa299) (T308204) - cookbook ran by taavi@runko
2022-05-10
- 15:18 taavi: depool tools-k8s-worker-42 for experiments
- 13:54 taavi: enable distro-wikimedia unattended upgrades T290494
2022-05-06
- 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl (T307812)
2022-05-05
- 17:28 taavi: deploy tools-webservice 0.83 T307693
2022-05-03
- 08:20 taavi: redis: start replication from the old cluster to the new one (T278541)
2022-05-02
- 08:54 taavi: restart acme-chief.service T307333
2022-04-25
- 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 (T214343)
- 14:46 bd808: Building toolforge-webservice v0.82
2022-04-23
- 16:51 bd808: Built new perl532-sssd/{base,web} images and pushed to registry (T214343)
2022-04-20
- 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?)
- 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api (8f37a04) - cookbook ran by taavi@runko
2022-04-16
- 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics (2c485e9) - cookbook ran by taavi@runko
2022-04-12
- 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' (T305986)
- 21:27 bd808: Added komla to 'roots' sudoers policy (T305986)
- 21:24 bd808: Add komla as projectadmin (T305986)
2022-04-10
- 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)
2022-04-09
- 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /
2022-04-08
- 10:44 arturo: disabled debug mode on the k8s jobs-emailer component
2022-04-05
- 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api (d7d3463) - cookbook ran by arturo@nostromo
- 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api (d7d3463) - cookbook ran by arturo@nostromo
- 07:21 arturo: deploying toolforge-jobs-framework-cli v7
2022-04-04
- 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api (cbcfc47) - cookbook ran by arturo@nostromo
- 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api (cbcfc47) - cookbook ran by arturo@nostromo
- 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions
2022-03-28
- 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud (T304816) - cookbook ran by arturo@nostromo
2022-03-15
- 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer (084ee51) - cookbook ran by arturo@nostromo
- 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...)
2022-03-14
- 11:44 arturo: deploy jobs-framework-emailer 9470a5f (T286135)
- 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo (T297090)
2022-03-10
- 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902
2022-03-01
- 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state (T302702)
- 12:11 dcaro: Cleared error state queues for sgeexec-0916 (T302702)
- 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand
2022-02-28
- 08:02 taavi: reboot sgeexec-0916
- 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /
2022-02-17
- 08:23 taavi: deleted tools-clushmaster-02
- 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access
2022-02-16
- 00:12 bd808: Image builds completed.
2022-02-15
- 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
- 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
- 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
- 22:50 bd808: Built new toollabs-webservice 0.81
- 18:43 bd808: Enabled puppet on tools-proxy-05
- 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
- 18:21 taavi: delete tools-package-builder-03
- 11:49 arturo: invalidate sssd cache in all bastions to debug T301736
- 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for T301736
- 11:15 arturo: reboot tools-sgebastion-10 for T301736
2022-02-10
- 15:07 taavi: shutdown tools-clushmaster-02 T298191
- 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
- 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
- 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
- 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
- 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
- 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
- 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
- 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally T214427
- 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - T214427
- 08:06 taavi: disable puppet globally for enabling puppetdb T214427
2022-02-09
- 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet T214427
- 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] (T277653) - cookbook ran by arturo@nostromo
- 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
- 18:25 arturo: ignore last message
- 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
- 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 T298191
2022-02-07
- 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository (T288406)
- 12:52 taavi: updated maintain-kubeusers for T301081
2022-02-04
- 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with T301015
- 21:36 taavi: clear error state from some webgrid nodes
2022-02-03
- 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
- 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate
2022-01-30
- 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover T278541
- 14:22 taavi: creating a cluster of 3 bullseye redis hosts for T278541
2022-01-26
- 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
- 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
- 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
- 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
- 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
- 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
- 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
- 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
- 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
- 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
- 13:55