You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(bstorm_: enabled encryption at rest on the new k8s cluster)
imported>Stashbot
(wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko)
(357 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== 2019-12-17 ===
=== 2022-08-11 ===
* 00:45 bstorm_: enabled encryption at rest on the new k8s cluster
* 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues


=== 2019-12-16 ===
=== 2022-08-05 ===
* 22:04 bd808: Added 'ALLOW IPv4 25/tcp from 0.0.0.0/0' to "MTA" security group applied to tools-mail-02
* 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 19:05 bstorm_: deployed the maintain-kubeusers operations pod to the new cluster
* 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2019-12-14 ===
=== 2022-08-03 ===
* 10:48 valhallasw`cloud: re-enabling puppet on tools-sgeexec-0912, likely left-over from NFS maintenance (no reason was specified).
* 15:51 dhinus: recreated jobs-api pods to pick up new ConfigMap
* 15:02 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c47ac41}}) - cookbook ran by fran@MacBook-Pro.station


=== 2019-12-13 ===
=== 2022-07-20 ===
* 18:46 bstorm_: updated tools-k8s-control-2 and 3 to the new config as well
* 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
* 17:56 bstorm_: updated tools-k8s-control-1 to the new control plane configuration
* 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:47 bstorm_: edited kubeadm-config configMap object to match the new init config
* 17:32 bstorm_: rebooting tools-k8s-control-2 to correct mount issue
* 00:45 bstorm_: rebooting tools-static-13
* 00:28 bstorm_: rebooting the k8s master to clear NFS errors
* 00:15 bstorm_: switch tools-acme-chief config to match the new authdns_servers format upstream


=== 2019-12-12 ===
=== 2022-07-19 ===
* 23:36 bstorm_: rebooting toolschecker after downtiming the services
* 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 22:58 bstorm_: rebooting tools-acme-chief-01
* 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 22:53 bstorm_: rebooting the cron server, tools-sgecron-01 as it wasn't recovered from last night's maintenance
* 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 11:20 arturo: rolling reboot for all grid & k8s worker nodes due to NFS staleness
* 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 09:22 arturo: reboot tools-sgeexec-0911 to try fixing weird NFS state
* 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest
* 08:46 arturo: doing `run-puppet-agent` in all VMs to see state of NFS
* 08:34 arturo: reboot tools-worker-1033/1034 and tools-sgebastion-08 to try to correct NFS mount issues


=== 2019-12-11 ===
=== 2022-07-17 ===
* 18:13 bd808: Restarted maintain-dbusers on labstore1004. Process had not logged any account creations since 2019-12-01T22:45:45.
* 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:24 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1031
* 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2019-12-10 ===
=== 2022-07-14 ===
* 13:59 arturo: set pod replicas to 3 in the new k8s cluster ([[phab:T239405|T239405]])
* 13:48 taavi: rebooting tools-sgeexec-10-2


=== 2019-12-09 ===
=== 2022-07-13 ===
* 11:06 andrewbogott: deleting unused security groups:  catgraph, devpi, MTA, mysql, syslog, test    [[phab:T91619|T91619]]
* 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2019-12-04 ===
=== 2022-07-11 ===
* 13:45 arturo: drop puppet prefix `tools-cron`, deprecated and no longer in use
* 16:06 wm-bot2: Increased quotas by <nowiki>{</nowiki>self.increases<nowiki>}</nowiki> ([[phab:T312692|T312692]]) - cookbook ran by nskaggs@x1carbon


=== 2019-11-29 ===
=== 2022-07-07 ===
* 11:45 arturo: created 3 new VMs `tools-k8s-worker-[3,4,5]` ([[phab:T239403|T239403]])
* 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 10:28 arturo: re-arm keyholder in tools-acme-chief-01 (password in labs/private.git @ tools-puppetmaster-01)
* 10:27 arturo: re-arm keyholder in tools-acme-chief-02 (password in labs/private.git @ tools-puppetmaster-01)


=== 2019-11-26 ===
=== 2022-06-28 ===
* 23:25 bstorm_: rebuilding docker images to include the new webservice 0.52 in all versions instead of just the stretch ones [[phab:T236202|T236202]]
* 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T311538|T311538]]) - cookbook ran by dcaro@vulcanus
* 22:57 bstorm_: push upgraded webservice 0.52 to the buster and jessie repos for container rebuilds [[phab:T236202|T236202]]
* 15:51 taavi: add 4096G cinder quota [[phab:T311509|T311509]]
* 19:55 phamhi: drained tools-worker-1002,8,15,32 to rebalance the cluster
* 19:45 phamhi: cleaned up container that was taken up 16G of disk space on tools-worker-1020 in order to re-run puppet client
* 14:01 arturo: drop hiera references to `tools-test-proxy-01.tools.eqiad.wmflabs`. Such VM no longer exists
* 14:00 arturo: introduce the `profile::toolforge::proxies` hiera key in the global puppet config


=== 2019-11-25 ===
=== 2022-06-27 ===
* 10:35 arturo: refresh puppet certs for tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])
* 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
* 10:35 arturo: add puppet cert SANs via instance hiera to tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])
* 18:02 taavi: switchover active cron server to tools-sgecron-2 [[phab:T284767|T284767]]
* 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:15 taavi: [[phab:T311412|T311412]] updating ca used by k8s-apiserver->etcd communication, breakage may happen
* 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 [[phab:T311412|T311412]]
* 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it [[phab:T311412|T311412]]


=== 2019-11-22 ===
=== 2022-06-23 ===
* 13:32 arturo: created security group `tools-new-k8s-full-connectivity` and add new k8s VMs to it ([[phab:T238654|T238654]])
* 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 05:55 jeh: add Riley Huntley `riley` to base tools project
* 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:59 taavi: removing remaining continuous jobs from the stretch grid [[phab:T277653|T277653]]


=== 2019-11-21 ===
=== 2022-06-22 ===
* 12:48 arturo: reboot the new k8s cluster after the upgrade
* 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 11:49 arturo: upgrading new k8s kubectl version to 1.15.6 ([[phab:T238654|T238654]])
* 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 11:44 arturo: upgrading new k8s kubelet version to 1.15.6 ([[phab:T238654|T238654]])
* 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 10:29 arturo: upgrading new k8s cluster version to 1.15.6 using kubeadm ([[phab:T238654|T238654]])
* 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 10:28 arturo: install kubeadm 1.15.6 on worker/control nodes in the new k8s cluster ([[phab:T238654|T238654]])


=== 2019-11-19 ===
=== 2022-06-21 ===
* 13:49 arturo: re-create nginx-ingress pod due to deployment template refresh ([[phab:T237643|T237643]])
* 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 12:46 arturo: deploy changes to tools-prometheus to account for the new k8s cluster ([[phab:T237643|T237643]])
* 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2019-11-15 ===
=== 2022-06-03 ===
* 14:44 arturo: stop live-hacks on tools-prometheus-01 [[phab:T237643|T237643]]
* 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor [[phab:T309821|T309821]]
* 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online [[phab:T309821|T309821]]
* 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
* 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor [[phab:T309821|T309821]]
* 15:50 balloons: temp add 1.0G swap to sgeweblight hosts [[phab:T309821|T309821]]
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
* 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
* 13:25 bd808: Upgrading fleet to tools-webservice 0.86 ([[phab:T309821|T309821]])
* 13:20 bd808: publish tools-webservice 0.86 ([[phab:T309821|T309821]])
* 12:46 taavi: start webservicemonitor on tools-sgecron-01 [[phab:T309821|T309821]]
* 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
* 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid [[phab:T309821|T309821]]
* 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
* 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package ([[phab:T309821|T309821]])
* 03:10 bd808: publish tools-webservice 0.85 with hack for [[phab:T309821|T309821]]


=== 2019-11-13 ===
=== 2022-06-02 ===
* 17:20 arturo: live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster ([[phab:T237643|T237643]])
* 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
* 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
* 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
* 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
* 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
* 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:03 dcaro: refresh prometheus certs ([[phab:T308402|T308402]])
* 11:47 dcaro: refresh registry-admission-controller certs ([[phab:T308402|T308402]])
* 11:42 dcaro: refresh ingress-admission-controller certs ([[phab:T308402|T308402]])
* 11:36 dcaro: refresh volume-admission-controller certs ([[phab:T308402|T308402]])
* 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2019-11-12 ===
=== 2022-06-01 ===
* 12:52 arturo: reboot tools-proxy-06 to reset iptables setup [[phab:T238058|T238058]]
* 11:18 taavi: depool and remove tools-sgeexec-09[07-14]


=== 2019-11-10 ===
=== 2022-05-31 ===
* 02:17 bd808: Building new Docker images for [[phab:T237836|T237836]] (retrying after cleaning out old images on tools-docker-builder-06)
* 16:51 taavi: delete tools-sgeexec-0904 for [[phab:T309525|T309525]] experimentation
* 02:15 bd808: Cleaned up old images on tools-docker-builder-06 using instructions from https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images
* 02:10 bd808: Building new Docker images for [[phab:T237836|T237836]]
* 01:45 bstorm_: deploying bugfix for webservice in tools and toolsbeta [[phab:T237836|T237836]]


=== 2019-11-08 ===
=== 2022-05-30 ===
* 22:47 bstorm_: adding rsync::server::wrap_with_stunnel: false to the tools-docker-registry-03/4 servers to unbreak puppet
* 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) [[phab:T277653|T277653]]
* 18:40 bstorm_: pushed new webservice package to the bastions [[phab:T230961|T230961]]
* 18:37 bstorm_: pushed new webservice package supporting buster containers to repo [[phab:T230961|T230961]]
* 18:36 bstorm_: pushed buster-sssd images to the docker repo
* 17:15 phamhi: pushed new buster images with the prefix name "toolforge"


=== 2019-11-07 ===
=== 2022-05-26 ===
* 13:27 arturo: deployed registry-admission-webhook and ingress-admission-controller into the new k8s cluster ([[phab:T236826|T236826]])
* 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T309146|T309146]]) - cookbook ran by taavi@runko
* 13:01 arturo: creating puppet prefix `tools-k8s-worker` and a couple of VMs `tools-k8s-worker-[1,2]` [[phab:T236826|T236826]]
* 12:57 arturo: increasing project quota [[phab:T237633|T237633]]
* 11:54 arturo: point `k8s.tools.eqiad1.wikimedia.cloud` to tools-k8s-haproxy-1 [[phab:T236826|T236826]]
* 11:43 arturo: create VMs `tools-k8s-haproxy-[1,2]` [[phab:T236826|T236826]]
* 11:43 arturo: create puppet prefix `tools-k8s-haproxy` [[phab:T236826|T236826]]


=== 2019-11-06 ===
=== 2022-05-22 ===
* 22:32 bstorm_: added rsync::server::wrap_with_stunnel: false to tools-sge-services prefix to fix puppet
* 17:04 taavi: failover tools-redis to the updated cluster [[phab:T278541|T278541]]
* 21:33 bstorm_: docker images needed for kubernetes cluster upgrade deployed [[phab:T215531|T215531]]
* 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud ([[phab:T308982|T308982]]) - cookbook ran by taavi@runko
* 20:26 bstorm_: building and pushing docker images needed for kubernetes cluster upgrade
* 16:10 arturo: new k8s cluster control nodes are bootstrapped ([[phab:T236826|T236826]])
* 13:51 arturo: created FQDN `k8s.tools.eqiad1.wikimedia.cloud` pointing to `tools-k8s-control-1` for the initial bootstrap ([[phab:T236826|T236826]])
* 13:50 arturo: created 3 VMs`tools-k8s-control-[1,2,3]` ([[phab:T236826|T236826]])
* 13:43 arturo: created `tools-k8s-control` puppet prefix [[phab:T236826|T236826]]
* 11:57 phamhi: restarted all webservices in grid ([[phab:T233347|T233347]])


=== 2019-11-05 ===
=== 2022-05-16 ===
* 23:08 Krenair: Dropped {{Gerrit|59a77a3}}, {{Gerrit|3830802}}, and {{Gerrit|83df61f}} from tools-puppetmaster-01:/var/lib/git/labs/private cherry-picks as these are no longer required [[phab:T206235|T206235]]
* 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|7037eca}}) - cookbook ran by taavi@runko
* 22:49 Krenair: Disassociated floating IP 185.15.56.60 from tools-static-13, traffic to this host goes via the project-proxy now. DNS was already changed a few days ago. [[phab:T236952|T236952]]
* 22:35 bstorm_: upgraded libpython3.4 libpython3.4-dbg libpython3.4-minimal libpython3.4-stdlib python3.4 python3.4-dbg python3.4-minimal to fix an old broken patch [[phab:T237468|T237468]]
* 22:12 bstorm_: pushed docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to the registry to deploy in toolsbeta
* 17:38 phamhi: restarted lighttpd based webservice pods on tools-worker-103x and 1040 ([[phab:T233347|T233347]])
* 17:34 phamhi: restarted lighttpd based webservice pods on tools-worker-102[0-9] ([[phab:T233347|T233347]])
* 17:06 phamhi: restarted lighttpd based webservice pods on tools-worker-101[0-9] ([[phab:T233347|T233347]])
* 16:44 phamhi: restarted lighttpd based webservice pods on tools-worker-100[1-9] ([[phab:T233347|T233347]])
* 13:55 arturo: created 3 new VMs: `tools-k8s-etcd-[4,5,6]` [[phab:T236826|T236826]]


=== 2019-11-04 ===
=== 2022-05-14 ===
* 14:45 phamhi: Built and pushed ruby25 docker image based on buster ([[phab:T230961|T230961]])
* 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940
* 14:45 phamhi: Built and pushed golang111 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed jdk11 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed php73 docker image based on buster ([[phab:T230961|T230961]])
* 11:10 phamhi: Built and pushed python37 docker image based on buster ([[phab:T230961|T230961]])


=== 2019-11-01 ===
=== 2022-05-12 ===
* 21:00 Krenair: Removed tools-checker.wmflabs.org A record to 208.80.155.229 as that target IP is in the old pre-neutron range that is no longer routed
* 12:36 taavi: re-enable CronJobControllerV2 [[phab:T308205|T308205]]
* 20:57 Krenair: Removed trusty.tools.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 09:28 taavi: deploy jobs-api update [[phab:T308204|T308204]]
* 20:56 Krenair: Removed tools-trusty.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T308204|T308204]]) - cookbook ran by taavi@runko
* 20:38 Krenair: Updated A record for tools-static.wmflabs.org to point towards project-proxy [[phab:T236952|T236952]]


=== 2019-10-31 ===
=== 2022-05-10 ===
* 18:47 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1001.  Runaway logfiles filled up the drive which prevented puppet from running.  If puppet had run, it would have prevented the runaway logfiles.
* 15:18 taavi: depool tools-k8s-worker-42 for experiments
* 13:59 arturo: update puppet prefix `tools-k8s-etcd-` to use the `role::wmcs::toolforge::k8s::etcd` [[phab:T236826|T236826]]
* 13:54 taavi: enable distro-wikimedia unattended upgrades [[phab:T290494|T290494]]
* 13:41 arturo: disabling puppet in tools-k8s-etcd- nodes to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/546995
* 10:15 arturo: SSL cert replacement for tools-docker-registry and tools-k8s-master went fine apparently ([[phab:T236962|T236962]])
* 10:02 arturo: icinga downtime toolschecker for 1h for replacing SSL certs in tools-docker-registry and tools-k8s-master ([[phab:T236962|T236962]])


=== 2019-10-30 ===
=== 2022-05-06 ===
* 13:53 arturo: replacing SSL cert in tools-proxy-x server apparently OK (merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679) [[phab:T235252|T235252]]
* 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl ([[phab:T307812|T307812]])
* 13:48 arturo: replacing SSL cert in tools-proxy-x server (live-hacking https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679 first for testing) [[phab:T235252|T235252]]
* 13:40 arturo: icinga downtime toolschecker for 1h for replacing SSL cert [[phab:T235252|T235252]]


=== 2019-10-29 ===
=== 2022-05-05 ===
* 10:49 arturo: deleting VMs tools-test-proxy-01, no longer in use
* 17:28 taavi: deploy tools-webservice 0.83 [[phab:T307693|T307693]]
* 10:07 arturo: deleting old jessie VMs tools-proxy-03/04 [[phab:T235627|T235627]]


=== 2019-10-28 ===
=== 2022-05-03 ===
* 16:06 arturo: delete VM instance `tools-test-proxy-01` and the puppet prefix `tools-test-proxy`
* 08:20 taavi: redis: start replication from the old cluster to the new one ([[phab:T278541|T278541]])
* 15:54 arturo: tools-proxy-05 has now the 185.15.56.11 floating IP as active proxy. Old one 185.15.56.6 has been freed [[phab:T235627|T235627]]
* 15:54 arturo: shutting down tools-proxy-03 [[phab:T235627|T235627]]
* 15:26 bd808: Killed all processes owned by jem on tools-sgebastion-08
* 15:16 arturo: tools-proxy-05 has now the 185.15.56.5 floating IP as active proxy [[phab:T235627|T235627]]
* 15:14 arturo: refresh hiera to use tools-proxy-05 as active proxy [[phab:T235627|T235627]]
* 15:11 bd808: Killed ircbot.php processes started by jem on tools-sgebastion-08 per request on irc
* 14:58 arturo: added `webproxy` security group to tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 14:57 phamhi: drained tools-worker-1031.tools.eqiad.wmflabs to clean up disk space
* 14:45 arturo: created VMs tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 14:43 arturo: adding `role::wmcs::toolforge::proxy` to the `tools-proxy` puppet prefix ([[phab:T235627|T235627]])
* 14:42 arturo: deleted `role::toollabs::proxy` from the `tools-proxy` puppet profile ([[phab:T235627|T235627]])
* 14:34 arturo: icinga downtime toolschecker for 1h ([[phab:T235627|T235627]])
* 12:25 arturo: upload image `coredns` v1.3.1 ({{Gerrit|eb516548c180}}) to docker registry ([[phab:T236249|T236249]])
* 12:23 arturo: upload image `kube-apiserver` v1.15.1 ({{Gerrit|68c3eb07bfc3}}) to docker registry ([[phab:T236249|T236249]])
* 12:22 arturo: upload image `kube-controller-manager` v1.15.1 ({{Gerrit|d75082f1d121}}) to docker registry ([[phab:T236249|T236249]])
* 12:20 arturo: upload image `kube-proxy` v1.15.1 ({{Gerrit|89a062da739d}}) to docker registry ([[phab:T236249|T236249]])
* 12:19 arturo: upload image `kube-scheduler` v1.15.1 ({{Gerrit|b0b3c4c404da}}) to docker registry ([[phab:T236249|T236249]])
* 12:04 arturo: upload image `calico/node` v3.8.0 ({{Gerrit|cd3efa20ff37}}) to docker registry ([[phab:T236249|T236249]])
* 12:03 arturo: upload image `calico/calico/pod2daemon-flexvol` v3.8.0 ({{Gerrit|f68c8f870a03}}) to docker registry ([[phab:T236249|T236249]])
* 12:01 arturo: upload image `calico/cni` v3.8.0 ({{Gerrit|539ca36a4c13}}) to docker registry ([[phab:T236249|T236249]])
* 11:58 arturo: upload image `calico/kube-controllers` v3.8.0 ({{Gerrit|df5ff96cd966}}) to docker registry ([[phab:T236249|T236249]])
* 11:47 arturo: upload image `nginx-ingress-controller` v0.25.1 ({{Gerrit|0439eb3e11f1}}) to docker registry ([[phab:T236249|T236249]])


=== 2019-10-24 ===
=== 2022-05-02 ===
* 16:32 bstorm_: set the prod rsyslog config for kubernetes to false for Toolforge
* 08:54 taavi: restart acme-chief.service [[phab:T307333|T307333]]


=== 2019-10-23 ===
=== 2022-04-25 ===
* 20:00 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.47 ([[phab:T233347|T233347]])
* 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 ([[phab:T214343|T214343]])
* 12:09 phamhi: Deployed toollabs-webservice 0.47 to buster-tools and stretch-tools ([[phab:T233347|T233347]])
* 14:46 bd808: Building toolforge-webservice v0.82
* 09:13 arturo: 9 tools-sgeexec nodes and 6 other related VMs are down because hypervisor is rebooting
* 09:03 arturo: tools-sgebastion-08 is down because hypervisor is rebooting


=== 2019-10-22 ===
=== 2022-04-23 ===
* 16:56 bstorm_: drained tools-worker-1025.tools.eqiad.wmflabs which was malfunctioning
* 16:51 bd808: Built new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images and pushed to registry ([[phab:T214343|T214343]])
* 09:25 arturo: created the `tools.eqiad1.wikimedia.cloud.` DNS zone


=== 2019-10-21 ===
=== 2022-04-20 ===
* 17:32 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.46
* 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?)
* 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8f37a04}}) - cookbook ran by taavi@runko


=== 2019-10-18 ===
=== 2022-04-16 ===
* 22:15 bd808: Rescheduled continuous jobs away from tools-sgeexec-0904 because of high system load
* 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics ({{Gerrit|2c485e9}}) - cookbook ran by taavi@runko
* 22:09 bd808: Cleared error state of webgrid-generic@tools-sgewebgrid-generic-0901, webgrid-lighttpd@tools-sgewebgrid-lighttpd-09{12,15,19,20,26}
* 21:29 bd808: Rescheduled all grid engine webservice jobs ([[phab:T217815|T217815]])


=== 2019-10-16 ===
=== 2022-04-12 ===
* 16:21 phamhi: Deployed toollabs-webservice 0.46 to buster-tools and stretch-tools ([[phab:T218461|T218461]])
* 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' ([[phab:T305986|T305986]])
* 09:29 arturo: toolforge is recovered from the reboot of cloudvirt1029
* 21:27 bd808: Added komla to 'roots' sudoers policy ([[phab:T305986|T305986]])
* 09:17 arturo: due to the reboot of cloudvirt1029, several sgeexec nodes (8) are offline, also sgewebgrid-lighttpd (8) and tools-worker (3) and the main toolforge proxy (tools-proxy-03)
* 21:24 bd808: Add komla as projectadmin ([[phab:T305986|T305986]])


=== 2019-10-15 ===
=== 2022-04-10 ===
* 17:10 phamhi: restart tools-worker-1035 because it is no longer responding
* 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)


=== 2019-10-14 ===
=== 2022-04-09 ===
* 09:26 arturo: cleaned-up updatetools from tools-sge-services nodes ([[phab:T229261|T229261]])
* 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /


=== 2019-10-11 ===
=== 2022-04-08 ===
* 19:52 bstorm_: restarted docker on tools-docker-builder after phamhi noticed the daemon had a routing issue (blank iptables)
* 10:44 arturo: disabled debug mode on the k8s jobs-emailer component
* 11:55 arturo: create tools-test-proxy-01 VM for testing [[phab:T235059|T235059]] and a puppet prefix for it
* 10:53 arturo: added kubernetes-node_1.4.6-7_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:51 arturo: added docker-engine_1.12.6-0~debian-jessie_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:46 arturo: added logster_0.0.10-2~jessie1_all.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]


=== 2019-10-10 ===
=== 2022-04-05 ===
* 02:33 bd808: Rebooting tools-sgewebgrid-lighttpd-0903. Instance hung.
* 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 07:21 arturo: deploying toolforge-jobs-framework-cli v7


=== 2019-10-09 ===
=== 2022-04-04 ===
* 22:52 jeh: removing test instances tools-sssd-sgeexec-test-[12] from SGE
* 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 15:32 phamhi: drained tools-worker-1020/23/33/35/36/40 to rebalance the cluster
* 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 14:46 phamhi: drained and cordoned tools-worker-1029 after status reset on reboot
* 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions
* 12:37 arturo: drain tools-worker-1038 to rebalance load in the k8s cluster
* 12:35 arturo: uncordon tools-worker-1029 (was disabled for unknown reasons)
* 12:33 arturo: drain tools-worker-1010 to rebalance load
* 10:33 arturo: several sgewebgrid-lighttpd nodes (9) not available because cloudvirt1013 is rebooting
* 10:21 arturo: several worker nodes (7) not available because cloudvirt1012 is rebooting
* 10:08 arturo: several worker nodes (6) not available because cloudvirt1009 is rebooting
* 09:59 arturo: several worker nodes (5) not available because cloudvirt1008 is rebooting


=== 2019-10-08 ===
=== 2022-03-28 ===
* 19:40 bstorm_: drained tools-worker-1007/8 to rebalance the cluster
* 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud ([[phab:T304816|T304816]]) - cookbook ran by arturo@nostromo
* 19:34 bstorm_: drained tools-worker-1009 and then 1014 for rebalancing
* 19:27 bstorm_: drained tools-worker-1005 for rebalancing (and put these back in service as I went)
* 19:24 bstorm_: drained tools-worker-1003 and 1009 for rebalancing
* 15:41 arturo: deleted VM instance tools-sgebastion-0test. No longer in use.


=== 2019-10-07 ===
=== 2022-03-15 ===
* 20:17 bd808: Dropped backlog of messages for delivery to tools.usrd-tools
* 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|084ee51}}) - cookbook ran by arturo@nostromo
* 20:16 bd808: Dropped backlog of messages for delivery to tools.mix-n-match
* 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...)
* 20:13 bd808: Dropped backlog of frozen messages for delivery (240 dropped)
* 19:25 bstorm_: deleted tools-puppetmaster-02
* 19:20 Krenair: reboot tools-k8s-master-01 due to nfs stale issue
* 19:18 Krenair: reboot tools-paws-worker-1006 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1040 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1039 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1038 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1037 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1036 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1035 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1034 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1033 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1032 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1031 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1030 due to nfs stale issue
* 19:10 Krenair: reboot tools-puppetmaster-02 due to nfs stale issue
* 19:09 Krenair: reboot tools-sgebastion-0test due to nfs stale issue
* 19:08 Krenair: reboot tools-sgebastion-09 due to nfs stale issue
* 19:08 Krenair: reboot tools-sge-services-04 due to nfs stale issue
* 19:07 Krenair: reboot tools-paws-worker-1002 due to nfs stale issue
* 19:06 Krenair: reboot tools-mail-02 due to nfs stale issue
* 19:06 Krenair: reboot tools-docker-registry-03 due to nfs stale issue
* 19:04 Krenair: reboot tools-worker-1029 due to nfs stale issue
* 19:00 Krenair: reboot tools-static-12 tools-docker-registry-04 and tools-clushmaster-02 due to NFS stale issue
* 18:55 phamhi: reboot tools-worker-1028 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1027 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1026 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1025 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1023 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1022 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1021 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1020 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1019 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1018 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1017 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1016 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1015 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1014 due to nfs stale issue
* 18:23 phamhi: reboot tools-worker-1013 due to nfs stale issue
* 18:21 phamhi: reboot tools-worker-1012 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1011 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1010 due to nfs stale issue
* 18:08 phamhi: reboot tools-worker-1009 due to nfs stale issue
* 18:07 phamhi: reboot tools-worker-1008 due to nfs stale issue
* 17:58 phamhi: reboot tools-worker-1007 due to nfs stale issue
* 17:57 phamhi: reboot tools-worker-1006 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1005 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1004 due to nfs stale issue
* 17:43 phamhi: reboot tools-worker-1002.tools.eqiad.wmflabs due to nfs stale issue
* 17:35 phamhi: drained and uncordoned tools-worker-100[1-5]
* 17:32 bstorm_: reboot tools-sgewebgrid-lighttpd-0912
* 17:30 bstorm_: reboot tools-sgewebgrid-lighttpd-0923/24/08
* 17:01 bstorm_: rebooting tools-sgegrid-master and tools-sgegrid-shadow 😭
* 16:58 bstorm_: rebooting tools-sgewebgrid-lighttpd-0902/4/6/7/8/19
* 16:53 bstorm_: rebooting tools-sgewebgrid-generic-0902/4
* 16:50 bstorm_: rebooting tools-sgeexec-0915/18/19/23/26
* 16:49 bstorm_: rebooting tools-sgeexec-0901 and tools-sgeexec-0909/10/11
* 16:46 bd808: `sudo shutdown -r now` for tools-sgebastion-08
* 16:41 bstorm_: reboot tools-sgebastion-07
* 16:39 bd808: `sudo service nslcd restart` on tools-sgebastion-08


=== 2019-10-04 ===
=== 2022-03-14 ===
* 21:43 bd808: `sudo exec-manage repool tools-sgeexec-0923.tools.eqiad.wmflabs`
* 11:44 arturo: deploy jobs-framework-emailer {{Gerrit|9470a5f339fd5a44c97c69ce97239aef30f5ee41}} ([[phab:T286135|T286135]])
* 21:26 bd808: Rebooting tools-sgeexec-0923 after lots of messing about with a broken update-initramfs build
* 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo ([[phab:T297090|T297090]])
* 20:35 bd808: Manually running `/usr/bin/python3 /usr/bin/unattended-upgrade` on tools-sgeexec-0923
* 20:33 bd808: Killed 2 /usr/bin/unattended-upgrade procs on tools-sgeexec-0923 that seemed stuck
* 13:33 arturo: remove /etc/init.d/rsyslog on tools-worker-XXXX nodes so the rsyslog deb prerm script doesn't prevent the package from being updated


=== 2019-10-03 ===
=== 2022-03-10 ===
* 13:05 arturo: delete servers tools-sssd-sgeexec-test-[1,2], no longer required
* 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902


=== 2019-09-27 ===
=== 2022-03-01 ===
* 16:59 bd808: Set "profile::rsyslog::kafka_shipper::kafka_brokers: []" in tools-elastic prefix puppet
* 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state ([[phab:T302702|T302702]])
* 00:40 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0927
* 12:11 dcaro: Cleared error state queues for sgeexec-0916 ([[phab:T302702|T302702]])
* 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand


=== 2019-09-25 ===
=== 2022-02-28 ===
* 19:08 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1021
* 08:02 taavi: reboot sgeexec-0916
* 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /


=== 2019-09-23 ===
=== 2022-02-17 ===
* 16:58 bstorm_: deployed tools-manifest 0.20 and restarted webservicemonitor
* 08:23 taavi: deleted tools-clushmaster-02
* 06:01 bd808: Restarted maintain-dbusers process on labstore1004. ([[phab:T233530|T233530]])
* 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access


=== 2019-09-12 ===
=== 2022-02-16 ===
* 20:48 phamhi: Deleted tools-puppetdb-01.tools as it is no longer in used
* 00:12 bd808: Image builds completed.


=== 2019-09-11 ===
=== 2022-02-15 ===
* 13:30 jeh: restart tools-sgeexec-0912
* 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
* 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
* 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
* 22:50 bd808: Built new toollabs-webservice 0.81
* 18:43 bd808: Enabled puppet on tools-proxy-05
* 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
* 18:21 taavi: delete tools-package-builder-03
* 11:49 arturo: invalidate sssd cache in all bastions to debug [[phab:T301736|T301736]]
* 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for [[phab:T301736|T301736]]
* 11:15 arturo: reboot tools-sgebastion-10 for [[phab:T301736|T301736]]


=== 2019-09-09 ===
=== 2022-02-10 ===
* 22:44 bstorm_: uncordoned tools-worker-1030 and tools-worker-1038
* 15:07 taavi: shutdown tools-clushmaster-02 [[phab:T298191|T298191]]
* 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally [[phab:T214427|T214427]]
* 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - [[phab:T214427|T214427]]
* 08:06 taavi: disable puppet globally for enabling puppetdb [[phab:T214427|T214427]]


=== 2019-09-06 ===
=== 2022-02-09 ===
* 15:11 bd808: `sudo kill -9 10635` on tools-k8s-master-01 ([[phab:T194859|T194859]])
* 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet [[phab:T214427|T214427]]
* 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] ([[phab:T277653|T277653]]) - cookbook ran by arturo@nostromo
* 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 18:25 arturo: ignore last message
* 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 [[phab:T298191|T298191]]


=== 2019-09-05 ===
=== 2022-02-07 ===
* 21:02 bd808: Enabled Puppet on tools-docker-registry-03 and forced puppet run ([[phab:T232135|T232135]])
* 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository ([[phab:T288406|T288406]])
* 18:13 bd808: Disabled Puppet on tools-docker-registry-03 to investigate docker-registry issue (no phab task yet)
* 12:52 taavi: updated maintain-kubeusers for [[phab:T301081|T301081]]


=== 2019-09-01 ===
=== 2022-02-04 ===
* 20:51 Reedy: `sudo service maintain-kubeusers restart` on tools-k8s-master-01
* 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with [[phab:T301015|T301015]]
* 21:36 taavi: clear error state from some webgrid nodes


=== 2019-08-30 ===
=== 2022-02-03 ===
* 16:54 phamhi: restart maintain-kuberusers service in tools-k8s-master-01
* 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
* 16:21 bstorm_: depooling tools-sgewebgrid-lighttpd-0923 to reboot -- high iowait likely from NFS mounts
* 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate


=== 2019-08-29 ===
=== 2022-01-30 ===
* 22:18 bd808: Finished building new stretch Docker images for Toolforge Kubernetes use
* 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover [[phab:T278541|T278541]]
* 22:06 bd808: Starting process of building new stretch Docker images for Toolforge Kubernetes use
* 14:22 taavi: creating a cluster of 3 bullseye redis hosts for [[phab:T278541|T278541]]
* 22:05 bd808: Jessie Docker image rebuild complete
* 21:31 bd808: Starting process of building new jessie Docker images for Toolforge Kubernetes use


=== 2019-08-27 ===
=== 2022-01-26 ===
* 19:10 bd808: Restarted maintain-kubeusers after complaint on irc. It was stuck in limbo again
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
* 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
* 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
* 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
* 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes ([[phab:T277653|T277653]])


=== 2019-08-26 ===
=== 2022-01-25 ===
* 21:48 bstorm_: repooled tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0902, tools-sgewebgrid-lighttpd-0903 and tools-sgeexec-0905
* 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 11:44 arturo: rebooting buster exec nodes
* 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4


=== 2019-08-18 ===
=== 2022-01-24 ===
* 08:11 arturo: restart maintain-kuberusers service in tools-k8s-master-01
* 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 15:23 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2019-08-17 ===
=== 2022-01-20 ===
* 10:56 arturo: force-reboot tools-worker-1006. Is completely stuck
* 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
* 12:56 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2019-08-15 ===
=== 2022-01-19 ===
* 15:32 jeh: upgraded jobutils debian package to 1.38 [[phab:T229551|T229551]]
* 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move
* 09:22 arturo: restart maintain-kubeusers service in tools-k8s-master-01 because some tools were missing their namespaces


=== 2019-08-13 ===
=== 2022-01-14 ===
* 22:00 bstorm_: truncated exim paniclog on tools-sgecron-01 because it was being spammy
* 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, [[phab:T299243|T299243]]
* 13:41 jeh: Set icingia downtime for toolschecker labs showmount [[phab:T229448|T229448]]


=== 2019-08-12 ===
=== 2022-01-12 ===
* 16:08 phamhi: updated prometheus-node-exporter from 0.14.0~git20170523-1 to 0.17.0+ds-3 in tools-worker-[1030-1040] nodes ([[phab:T230147|T230147]])
* 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
* 11:03 arturo: created puppet prefix 'tools-sgeweblig'
* 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'


=== 2019-08-08 ===
=== 2022-01-04 ===
* 19:26 jeh: restarting tools-sgewebgrid-lighttpd-0915 [[phab:T230157|T230157]]
* 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
 
* 08:12 taavi: disable puppet & exim4 on [[phab:T298501|T298501]]
=== 2019-08-07 ===
* 19:07 bd808: Disassociated SUL and Phabricator accounts from user Lophi ([[phab:T229713|T229713]])
 
=== 2019-08-06 ===
* 16:18 arturo: add phamhi as user/projectadmin ([[phab:T228942|T228942]]) and delete hpham
* 15:59 arturo: add hpham as user/projectadmin ([[phab:T228942|T228942]])
* 13:44 jeh: disabling puppet on tools-checker-03 while testing nginx timeouts [[phab:T221301|T221301]]
 
=== 2019-08-05 ===
* 22:49 bstorm_: launching tools-worker-1040
* 20:36 andrewbogott: rebooting oom tools-worker-1026
* 16:10 jeh: `tools-k8s-master-01: systemctl restart maintain-kubeusers` [[phab:T229846|T229846]]
* 09:39 arturo: `root@tools-checker-03:~# toolscheckerctl restart` again ([[phab:T229787|T229787]])
* 09:30 arturo: `root@tools-checker-03:~# toolscheckerctl restart` ([[phab:T229787|T229787]])
 
=== 2019-08-02 ===
* 14:00 andrewbogott_: rebooting tools-worker-1022 as it is unresponsive
 
=== 2019-07-31 ===
* 18:07 bstorm_: drained tools-worker-1015/05/03/17 to rebalance load
* 17:41 bstorm_: drained tools-worker-1025 and 1026 to rebalance load
* 17:32 bstorm_: drained tools-worker-1028 to rebalance load
* 17:29 bstorm_: drained tools-worker-1008 to rebalance load
* 17:23 bstorm_: drained tools-worker-1021 to rebalance load
* 17:17 bstorm_: drained tools-worker-1007 to rebalance load
* 17:07 bstorm_: drained tools-worker-1004 to rebalance load
* 16:27 andrewbogott: moving tools-static-12 to cloudvirt1018
* 15:33 bstorm_: [[phab:T228573|T228573]] spinning up 5 worker nodes for kubernetes cluster (tools-worker-1035-9)
 
=== 2019-07-27 ===
* 23:00 zhuyifei1999_: a past probably related ticket: [[phab:T194859|T194859]]
* 22:57 zhuyifei1999_: maintain-kubeusers seems stuck. Traceback: https://phabricator.wikimedia.org/P8812, core dump: /root/core.17898. Restarting
 
=== 2019-07-26 ===
* 17:39 bstorm_: restarted maintain-kubeusers because it was suspiciously tardy and quiet
* 17:14 bstorm_: drained tools-worker-1013.tools.eqiad.wmflabs to rebalance load
* 17:09 bstorm_: draining tools-worker-1020.tools.eqiad.wmflabs to rebalance load
* 16:32 bstorm_: created tools-worker-1034 - [[phab:T228573|T228573]]
* 15:57 bstorm_: created tools-worker-1032 and 1033 - [[phab:T228573|T228573]]
* 15:55 bstorm_: created tools-worker-1031 - [[phab:T228573|T228573]]
 
=== 2019-07-25 ===
* 22:01 bstorm_: [[phab:T228573|T228573]] created tools-worker-1030
* 21:22 jeh: rebooting tools-worker-1016 unresponsive
 
=== 2019-07-24 ===
* 10:14 arturo: reallocating tools-puppetmaster-01 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
* 10:12 arturo: reallocating tools-docker-registry-04 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
 
=== 2019-07-22 ===
* 18:39 bstorm_: repooled tools-sgeexec-0905 after reboot
* 18:33 bstorm_: depooled tools-sgeexec-0905 because it's acting kind of weird and not responding to prometheus
* 18:32 bstorm_: repooled tools-sgewebgrid-lighttpd-0902 after restarting the grid-exec service
* 18:28 bstorm_: depooled tools-sgewebgrid-lighttpd-0902 to find out why it is behaving weird
* 17:55 bstorm_: draining tools-worker-1023 since it is having issues
* 17:38 bstorm_: Adding the prometheus servers to the ferm rules via wikitech hiera for kubelet stats [[phab:T228573|T228573]]
 
=== 2019-07-20 ===
* 19:52 andrewbogott: rebooting tools-worker-1023
 
=== 2019-07-17 ===
* 20:23 andrewbogott: migrating tools-sgegrid-shadow to cloudvirt1014
 
=== 2019-07-15 ===
* 14:50 bstorm_: cleared error state from tools-sgeexec-0911 which went offline after error from job {{Gerrit|5190035}}
 
=== 2019-06-25 ===
* 09:30 arturo: detected puppet issue in all VMs: [[phab:T226480|T226480]]
 
=== 2019-06-24 ===
* 17:42 andrewbogott: moving tools-sgeexec-0905 to cloudvirt1015
 
=== 2019-06-17 ===
* 14:07 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1015
* 13:59 andrewbogott: moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: [[phab:T220853|T220853]] )
 
=== 2019-06-11 ===
* 18:03 bstorm_: deleted anomalous kubernetes node tools-worker-1019.eqiad.wmflabs
 
=== 2019-06-05 ===
* 18:33 andrewbogott: repooled  tools-sgeexec-0921 and tools-sgeexec-0929
* 18:16 andrewbogott: depooling and moving tools-sgeexec-0921 and tools-sgeexec-0929
 
=== 2019-05-30 ===
* 13:01 arturo: uncordon/repool tools-worker-1001/2/3. They should be fine now. I'm only leaving 1029 cordoned for testing purposes
* 13:01 arturo: reboot tools-woker-1003 to cleanup sssd config and let nslcd/nscd start freshly
* 12:47 arturo: reboot tools-woker-1002 to cleanup sssd config and let nslcd/nscd start freshly
* 12:42 arturo: reboot tools-woker-1001 to cleanup sssd config and let nslcd/nscd start freshly
* 12:35 arturo: enable puppet in tools-worker nodes
* 12:29 arturo: switch hiera setting back to classic/sudoldap for tools-worker because [[phab:T224651|T224651]] ([[phab:T224558|T224558]])
* 12:25 arturo: cordon/drain tools-worker-1002 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:23 arturo: cordon/drain tools-worker-1001 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:22 arturo: cordon/drain tools-worker-1029 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:20 arturo: cordon/drain tools-worker-1003 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 11:59 arturo: [[phab:T224558|T224558]] repool tools-worker-1003 (using sssd/sudo now!)
* 11:23 arturo: [[phab:T224558|T224558]] depool tools-worker-1003
* 10:48 arturo: [[phab:T224558|T224558]] drop/build a VM for tools-worker-1002. It didn't like the sssd/sudo change :-(
* 10:33 arturo: [[phab:T224558|T224558]] switch tools-worker-1002 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:28 arturo: [[phab:T224558|T224558]] use hiera config in prefix tools-worker for sssd/sudo
* 10:27 arturo: [[phab:T224558|T224558]] switch tools-worker-1001 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:09 arturo: [[phab:T224558|T224558]] disable puppet in all tools-worker- nodes
* 10:01 arturo: [[phab:T224558|T224558]] add tools-worker-1029 to the nodes pool of k8s
* 09:58 arturo: [[phab:T224558|T224558]] reboot tools-worker-1029 after puppet changes for sssd/sudo in jessie
 
=== 2019-05-29 ===
* 11:13 arturo: briefly tested some sssd config changes in tools-sgebastion-09
* 10:13 arturo: enroll the tools-worker-1029 VM into toolforge k8s, but leave it cordoned for sssd testing purposes ([[phab:T221225|T221225]])
* 10:12 arturo: re-create the tools-worker-1001 VM, already enrolled into toolforge k8s
* 09:34 arturo: delete tools-worker-1001, it was totally malfunctioning
 
=== 2019-05-28 ===
* 18:15 arturo: [[phab:T221225|T221225]] for the record, tools-worker-1001 is not working after trying with sssd
* 18:13 arturo: [[phab:T221225|T221225]] created tools-worker-1029 to test sssd/sudo stuff
* 17:49 arturo: [[phab:T221225|T221225]] repool tools-worker-1002 (using nscd/nslcd and sudoldap)
* 17:44 arturo: [[phab:T221225|T221225]] back to classic/ldap hiera config in the tools-worker puppet prefix
* 17:35 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001 again
* 17:27 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001
* 17:12 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1002
* 17:09 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1001
* 17:08 arturo: [[phab:T221225|T221225]] switch to sssd/sudo in puppet prefix for tools-worker
* 13:04 arturo: [[phab:T221225|T221225]] depool and rebooted tools-worker-1001 in preparation for sssd migration
* 12:39 arturo: [[phab:T221225|T221225]] disable puppet in all tools-worker nodes in preparation for sssd
* 12:32 arturo: drop the tools-bastion puppet prefix, unused
* 12:31 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-checker prefix, and reboot tools-checker-03
* 12:27 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-docker-registry prefix, and reboot tools-docker-registry-[03-04]
* 12:16 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-sgebastion prefix, and reboot tools-sgebastion-07/08
* 11:26 arturo: merged change to the sudo module to allow sssd transition
 
=== 2019-05-27 ===
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90% (on tools-sgebastion-08)
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90%
 
=== 2019-05-21 ===
* 12:35 arturo: [[phab:T223992|T223992]] rebooting tools-redis-1002
 
=== 2019-05-20 ===
* 11:25 arturo: [[phab:T223332|T223332]] enable puppet agent in tools-k8s-master and tools-docker-registry nodes and deploy new SSL cert
* 10:53 arturo: [[phab:T223332|T223332]] disable puppet agent in tools-k8s-master and tools-docker-registry nodes
 
=== 2019-05-18 ===
* 11:13 chicocvenancio: PAWS update helm chart to point to new singleuser image ([[phab:T217908|T217908]])
* 09:06 bd808: Rebuilding all stretch docker images to pick up toollabs-webservice 0.45
 
=== 2019-05-17 ===
* 17:36 bd808: Rebuilding all docker images to pick up toollabs-webservice 0.45
* 17:35 bd808: Deployed toollabs-webservice 0.45 (python 3.5 and nodejs 10 containers)
 
=== 2019-05-16 ===
* 11:22 chicocvenancio: PAWS: restart hub to get new configured announcement
* 11:05 chicocvenancio: PAWS: change confimap to reference WMHACK 2019 as busiest time
 
=== 2019-05-15 ===
* 16:20 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-0921 and -0929
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0921 and move to cloudvirt1014
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and move to cloudvirt1014
* 12:29 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-09[37,39]
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0937 and move to cloudvirt1008
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0939 and move to cloudvirt1007
* 11:34 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0940
* 11:20 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0940 and move to cloudvirt1006
* 11:11 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0941
* 10:46 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0941 and move to cloudvirt1005
* 09:44 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0901
* 09:00 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0901 and reallocate to cloudvirt1004
 
=== 2019-05-14 ===
* 17:12 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0920
* 16:37 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and reallocate to cloudvirt1003
* 16:36 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0911
* 15:56 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0911 and reallocate to cloudvirt1003
* 15:52 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0909
* 15:24 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0909 and reallocate to cloudvirt1002
* 15:24 arturo: [[phab:T223148|T223148]] last SAL entry is bogus, please ignore (depool tools-worker-1009)
* 15:23 arturo: [[phab:T223148|T223148]] depool tools-worker-1009
* 15:13 arturo: [[phab:T223148|T223148]] repool tools-worker-1023
* 13:16 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0942
* 13:03 arturo: [[phab:T223148|T223148]] repool tools-sgewebgrid-generic-0904
* 12:58 arturo: [[phab:T223148|T223148]] reallocating tools-worker-1023 to cloudvirt1001
* 12:56 arturo: [[phab:T223148|T223148]] depool tools-worker-1023
* 12:52 arturo: [[phab:T223148|T223148]] reallocating tools-sgeexec-0942 to cloudvirt1001
* 12:50 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0942
* 12:49 arturo: [[phab:T223148|T223148]] reallocating tools-sgewebgrid-generic-0904 to cloudvirt1001
* 12:43 arturo: [[phab:T223148|T223148]] depool tools-sgewebgrid-generic-0904
 
=== 2019-05-13 ===
* 08:15 zhuyifei1999_: `truncate -s 0 /var/log/exim4/paniclog` on tools-sgecron-01.tools.eqiad.wmflabs & tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs
 
=== 2019-05-07 ===
* 14:38 arturo: [[phab:T222718|T222718]] uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned
* 14:31 arturo: [[phab:T222718|T222718]] reboot tools-worker-1009 and 1022 after being drained
* 14:28 arturo: k8s drain tools-worker-1009 and 1022
* 11:46 arturo: [[phab:T219362|T219362]] enable puppet in tools-redis servers and use the new puppet role
* 11:33 arturo: [[phab:T219362|T219362]] disable puppet in tools-reds servers for puppet code cleanup
* 11:12 arturo: [[phab:T219362|T219362]] drop the `tools-services` puppet prefix (we are actually using `tools-sgeservices`)
* 11:10 arturo: [[phab:T219362|T219362]] enable puppet in tools-static servers and use new puppet role
* 11:01 arturo: [[phab:T219362|T219362]] disable puppet in tools-static servers for puppet code cleanup
* 10:16 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-lighttpd` puppet prefix
* 10:14 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-generic` puppet prefix
* 10:06 arturo: [[phab:T219362|T219362]] drop the `tools-exec-1` puppet prefix
 
=== 2019-05-06 ===
* 11:34 arturo: [[phab:T221225|T221225]] reenable puppet
* 10:53 arturo: [[phab:T221225|T221225]] disable puppet in all toolforge servers for testing sssd patch (puppetmaster livehack)
 
=== 2019-05-03 ===
* 09:43 arturo: fixed puppet in tools-puppetdb-01 too
* 09:39 arturo: puppet should be now fine across toolforge (except tools-puppetdb-01 which is WIP I think)
* 09:37 arturo: fix puppet in tools-elastic-03, archived jessie repos, weird rsyslog-kafka package situation
* 09:33 arturo: fix puppet in tools-elastic-02, archived jessie repos, weird rsyslog-kafka package situation
* 09:24 arturo: fix puppet in tools-elastic-01, archived jessie repos, weird rsyslog-kafka package situation
* 09:18 arturo: solve a weird apt situation in tools-puppetmaster-01 regarding the rsyslog-kafka package (puppet agent was failing)
* 09:16 arturo: solve a weird apt situation in tools-worker-1028 regarding the rsyslog-kafka package
 
=== 2019-04-30 ===
* 12:50 arturo: enable puppet in all servers [[phab:T221225|T221225]]
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd ([[phab:T221225|T221225]])
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd
* 11:07 arturo: [[phab:T221225|T221225]] disable puppet in toolforge
* 10:56 arturo: [[phab:T221225|T221225]] create tools-sgebastion-0test for more sssd tests
 
=== 2019-04-29 ===
* 11:22 arturo: [[phab:T221225|T221225]] re-enable puppet agent in all toolforge servers
* 10:27 arturo: [[phab:T221225|T221225]] reboot tool-sgebastion-09 for testing sssd
* 10:21 arturo: disable puppet in all servers to livehack tools-puppetmaster-01 to test [[phab:T221225|T221225]]
* 08:29 arturo: cleanup disk in tools-sgebastion-09, was full of debug logs and unused apt packages
 
=== 2019-04-26 ===
* 12:20 andrewbogott: rescheduling every pod everywhere
* 12:18 andrewbogott: rescheduling all pods on tools-worker-1023.tools.eqiad.wmflabs
 
=== 2019-04-25 ===
* 12:49 arturo: [[phab:T221225|T221225]] using `profile::ldap::client::labs::client_stack: sssd` in horizon for tools-sgebastion-09 (testing)
* 11:43 arturo: [[phab:T221793|T221793]] removing prometheus crontab and letting puppet agent re-create it again to resolve staleness
 
=== 2019-04-24 ===
* 12:54 arturo: puppet broken, fixing right now
* 09:18 arturo: [[phab:T221225|T221225]] reallocating tools-sgebastion-09 to cloudvirt1008
 
=== 2019-04-23 ===
* 15:26 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-08 to cleanup sssd
* 15:19 arturo: [[phab:T221225|T221225]] creating tools-sgebastion-09 for testing sssd stuff
* 13:06 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix, again. Rollback again.
* 12:57 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix, try again with sssd in the bastions, reboot them
* 10:28 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix
* 10:27 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-07 to clean sssd confiuration
* 10:16 arturo: [[phab:T221225|T221225]] disable puppet in tools-sgebastion-08 for sssd testing
* 09:49 arturo: [[phab:T221225|T221225]] run puppet agent in the bastions and reboot them with sssd
* 09:43 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix
* 09:41 arturo: [[phab:T221225|T221225]] disable puppet agent in the bastions
 
=== 2019-04-17 ===
* 12:09 arturo: [[phab:T221225|T221225]] rebooting bastions to clean sssd. We are back to nscd/nslcd until we figure out what's wrong here
* 11:59 arturo: [[phab:T221205|T221205]] sssd was deployed successfully into all webgrid nodes
* 11:39 arturo: deploy sssd to tools-sge-services-03/04 (includes reboot)
* 11:31 arturo: reboot bastions for sssd deployment
* 11:30 arturo: deploy sssd to bastions
* 11:24 arturo: disable puppet in bastions to deploy sssd
* 09:52 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0915 requires some manual intervention because issues in the dpkg database prevents deleting nscd/nslcd packages
* 09:45 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0913 requires some manual intervention because unconfigured packages prevents a clean puppet agent run
* 09:12 arturo: [[phab:T221205|T221205]] start deploying sssd to sgewebgrid nodes
* 09:00 arturo: [[phab:T221205|T221205]] add `profile::ldap::client::labs::client_stack: sssd` in horizon for the puppet prefixes `tools-sgewebgrid-lighttpd` and `tools-sgewebgrid-generic`
* 08:57 arturo: [[phab:T221205|T221205]] disable puppet in all tools-sgewebgrid-* nodes
 
=== 2019-04-16 ===
* 20:49 chicocvenancio: change paws announcement in configmap hub-config back to a welcome message
* 17:15 chicocvenancio: add paws outage announcement in configmap hub-config
* 17:00 andrewbogott: moving tools-k8s-master-01 to eqiad1-r
 
=== 2019-04-15 ===
* 18:50 andrewbogott: moving tools-elastic-01 to cloudvirt1008 to make spreadcheck happy
* 15:01 andrewbogott: moving tools-redis-1001 to eqiad1-r
 
=== 2019-04-14 ===
* 16:23 andrewbogott: moved all tools-worker nodes off of cloudvirt1015 and uncordoned them
 
=== 2019-04-13 ===
* 21:09 bstorm_: Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for [[phab:T220853|T220853]]
* 20:36 bstorm_: moving tools-elastic-02 to cloudvirt1009 for [[phab:T220853|T220853]]
* 19:58 bstorm_: started migrating tools-k8s-etcd-03 to cloudvirt1012 [[phab:T220853|T220853]]
* 19:51 bstorm_: started migrating tools-flannel-etcd-02 to cloudvirt1013 [[phab:T220853|T220853]]
 
=== 2019-04-11 ===
* 22:38 andrewbogott: moving tools-paws-worker-1005 to cloudvirt1009 to make spreadcheck happier
* 21:49 bd808: Re-enabled puppet on tools-elastic-02 and forced puppet run
* 21:44 andrewbogott: moving tools-mail-02 to eqiad1-r
* 20:56 andrewbogott: shutting down tools-logs-02 — seems unused
* 19:44 bd808: Disabled puppet on tools-elastic-02 and set to 1-node master
* 19:34 andrewbogott: moving tools-puppetmaster-01 to eqiad1-r
* 15:40 andrewbogott: moving tools-redis-1002  to eqiad1-r
* 13:52 andrewbogott: moving tools-prometheus-01 and tools-elastic-01 to eqiad1-r
* 12:01 arturo: [[phab:T151704|T151704]] deploying oidentd
* 11:54 arturo: disable puppet in all hosts to deploy oidentd
* 02:33 andrewbogott: tools-paws-worker-1005,  tools-paws-worker-1006 to eqiad1-r
* 00:03 andrewbogott: tools-paws-worker-1002,  tools-paws-worker-1003 to eqiad1-r
 
=== 2019-04-10 ===
* 23:58 andrewbogott: moving tools-clushmaster-02, tools-elastic-03 and tools-paws-worker-1001 to eqiad1-r
* 18:52 bstorm_: depooled and rebooted tools-sgeexec-0929 because systemd was in a weird state
* 18:46 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0913 because high load was caused by ancient lsof processes
* 14:49 bstorm_: cleared E state from 5 queues
* 13:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0906
* 12:31 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0926
* 12:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0925
* 12:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0901
* 11:55 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0924
* 11:47 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0921
* 11:23 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0940
* 11:03 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0928
* 10:49 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0923
* 10:43 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0915
* 10:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0935
* 10:19 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0914
* 10:02 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0907
* 09:41 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0918
* 09:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0932
* 09:26 arturo: [[phab:T218216|T218216]] hard reboot tools-sgeexec-0932
* 09:04 arturo: [[phab:T218216|T218216]] add `profile::ldap::client::labs::client_stack: sssd` to prefix puppet for sge-exec nodes
* 09:03 arturo: [[phab:T218216|T218216]] do a controlled rollover of sssd, depooling sgeexec nodes, reboot and repool
* 08:39 arturo: [[phab:T218216|T218216]] disable puppet in all tools-sgeexec-XXXX nodes for controlled sssd rollout
* 00:32 andrewbogott: migrating tools-worker-1022, 1023, 1025, 1026 to eqiad1-r
 
=== 2019-04-09 ===
* 22:04 bstorm_: added the new region on port 80 to the elasticsearch security group for stashbot
* 21:16 andrewbogott: moving tools-flannel-etcd-03 to eqiad1-r
* 20:43 andrewbogott: moving tools-worker-1018, 1019, 1020, 1021 to eqiad1-r
* 20:04 andrewbogott: moving tools-k8s-etcd-03 to eqiad1-r
* 19:54 andrewbogott: moving tools-flannel-etcd-02 to eqiad1-r
* 18:36 andrewbogott: moving tools-worker-1016, tools-worker-1017 to eqiad1-r
* 18:05 andrewbogott: migrating tools-k8s-etcd-02 to eqiad1-r
* 18:00 andrewbogott: migrating tools-flannel-etcd-01 to eqiad1-r
* 17:36 andrewbogott: moving tools-worker-1014, tools-worker-1015 to eqiad1-r
* 17:05 andrewbogott: migrating  tools-k8s-etcd-01 to eqiad1-r
* 15:56 andrewbogott: moving tools-worker-1012, tools-worker-1013 to eqiad1-r
* 14:56 bstorm_: cleared 4 queues on gridengine of E status (ldap again)
* 14:07 andrewbogott: moving tools-worker-1010, tools-worker-1011, tools-worker-1001 to eqiad1-r
* 03:48 andrewbogott: moving tools-worker-1008 and tools-worker-1009 to eqiad1-r
* 02:07 bstorm_: reloaded ferm on tools-flannel-etcd-0[1-3] to get  the k8s node moves to register
 
=== 2019-04-08 ===
* 22:36 andrewbogott: moving tools-worker-1006 and tools-worker-1007 to eqiad1-r
* 20:03 andrewbogott: moving tools-worker-1003 and tools-worker-1004 to eqiad1-r
 
=== 2019-04-07 ===
* 16:54 zhuyifei1999_: tools-sgeexec-0928 unresponsive since around 22 UTC. No data on Graphite. Can't ssh in even as root. Hard rebooting via Horizon
* 01:06 bstorm_: cleared E state from 6 queues
 
=== 2019-04-05 ===
* 15:44 bstorm_: cleared E state from two exec queues
 
=== 2019-04-04 ===
* 21:21 bd808: Uncordoned tools-worker-1013.tools.eqiad.wmflabs after reboot and forced puppet run
* 20:53 bd808: Rebooting tools-worker-1013
* 20:50 bd808: Draining tools-worker-1013.tools.eqiad.wmflabs
* 20:29 bd808: Released floating IP and deleted instance tools-checker-01 via Horizon
* 20:28 bd808: Shutdown tools-checker-01 via Horizon
* 20:17 bd808: Repooled tools-webgrid-lighttpd-0906 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:13 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0906 via Horizon
* 20:09 bd808: Repooled tools-webgrid-lighttpd-0912 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:05 bd808: Depooled and rebooted tools-sgewebgrid-lighttpd-0912
* 20:05 bstorm_: rebooted tools-webgrid-lighttpd-0912
* 20:03 bstorm_: depooled  tools-webgrid-lighttpd-0912
* 19:59 bstorm_: depooling and rebooting tools-webgrid-lighttpd-0906
* 19:43 bd808: Repooled tools-sgewebgrid-lighttpd-0926 after reboot, apt-get dist-update, and forced puppet run
* 19:36 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0926 via Horizon
* 19:30 bd808: Rebooting tools-sgewebgrid-lighttpd-0926
* 19:28 bd808: Depooled tools-sgewebgrid-lighttpd-0926
* 19:13 bstorm_: cleared E state from 7 queues
* 17:32 andrewbogott: moving tools-static-12 to cloudvirt1023 to keep the two static nodes off the same host
 
=== 2019-04-03 ===
* 11:22 arturo: puppet breakage in due to me introducing openstack-mitaka-jessie repo by mistake. Cleaning up already
 
=== 2019-04-02 ===
* 12:11 arturo: icinga downtime toolschecker for 1 month [[phab:T219243|T219243]]
* 03:55 bd808: Added etcd service group to tools-k8s-etcd-* ([[phab:T219243|T219243]])
 
=== 2019-04-01 ===
* 19:44 bd808: Deleted tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 19:43 bd808: Shutdown tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 16:53 bstorm_: cleared E state on 6 grid queues
* 14:54 andrewbogott: moving tools-static-12 to eqiad1-r (for real this time maybe)
 
=== 2019-03-29 ===
* 21:13 bstorm_: depooled tools-sgewebgrid-generic-0903 because of some stuck jobs and odd load characteristics
* 21:09 bd808: Updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ on tools-puppetmaster-01 ([[phab:T219243|T219243]])
* 20:48 bd808: Using root console to fix broken initial puppet run on tools-checker-03.
* 20:32 bd808: Creating tools-checker-03 with role::wmcs::toolforge::checker ([[phab:T219243|T219243]])
* 20:24 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ to tools-puppetmaster-01 for testing ([[phab:T219243|T219243]])
* 20:22 bd808: Disabled puppet on tools-checker-0{1,2} to make testing new role::wmcs::toolforge::checker easier ([[phab:T219243|T219243]])
* 17:25 bd808: Cleared the "Eqw" state of 44 jobs with `qstat -u '*' {{!}} grep Eqw {{!}} awk '{print $1;}' {{!}} xargs -L1 sudo qmod -cj` on tools-sgegrid-master
* 17:16 andrewbogott: aborted move of tools-static-12; will wait until tomorrow and give DNS caches more time to update
* 17:11 bd808: Restarted nginx on tools-static-13
* 16:53 andrewbogott: moving tools-static-12 to eqiad1-r
* 16:49 bstorm_: cleared E state from 21 queues
* 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
* 13:54 andrewbogott: moving tools-static-13 to eqiad1-r
 
=== 2019-03-28 ===
* 01:00 bstorm_: cleared error states from two queues
* 00:23 bstorm_: [[phab:T216060|T216060]] created tools-sgewebgrid-generic-0901...again!
 
=== 2019-03-27 ===
* 23:35 bstorm_: rebooted tools-paws-master-01 for NFS issue [[phab:T219460|T219460]]
* 14:45 bstorm_: cleared several "E" state queues
* 12:26 gtirloni: truncated exim4/paniclog on tools-sgewebgrid-lighttpd-0921
* 12:25 gtirloni: truncated exim4/paniclog on tools-sgecron-01
* 12:15 arturo: [[phab:T218126|T218126]] `aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2'` (and 1)
 
=== 2019-03-26 ===
* 22:00 gtirloni: downtimed toolschecker
* 17:31 arturo: [[phab:T218126|T218126]] create VM instances tools-sssd-sgeexec-test-[12]
* 00:26 bd808: Deleted DNS record for login-trusty.tools.wmflabs.org
* 00:26 bd808: Deleted DNS record for trusty-dev.tools.wmflabs.org
 
=== 2019-03-25 ===
* 21:21 bd808: All Trusty grid engine hosts shutdown and deleted ([[phab:T217152|T217152]])
* {{safesubst:SAL entry|1=21:19 bd808: Deleted tools-grid-{master,shadow}  (T217152)}}
* 21:18 bd808: Deleted tools-webgrid-lighttpd-14*  ([[phab:T217152|T217152]])
* 20:55 bstorm_: reboot tools-sgewebgrid-generic-0903 to clear up some issues
* 20:52 bstorm_: rebooting tools-package-builder-02 due to lots of hung /usr/bin/lsof +c 15 -nXd DEL processes
* 20:51 bd808: Deleted tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-143* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-142* ([[phab:T217152|T217152]])
* 20:48 bd808: Deleted tools-exec-141* ([[phab:T217152|T217152]])
* 20:47 bd808: Deleted tools-exec-140* ([[phab:T217152|T217152]])
* 20:43 bd808: Deleted  tools-cron-01 ([[phab:T217152|T217152]])
* 20:42 bd808: Deleted tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 20:35 bstorm_: rebooted tools-worker-1025 and tools-worker-1021
* 19:59 bd808: Shutdown tools-exec-143* ([[phab:T217152|T217152]])
* 19:51 bd808: Shutdown tools-exec-142* ([[phab:T217152|T217152]])
* 19:47 bstorm_: depooling tools-worker-1025.tools.eqiad.wmflabs because it's not responding and showing insane load
* 19:33 bd808: Shutdown tools-exec-141* ([[phab:T217152|T217152]])
* 19:31 bd808: Shutdown tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 19:19 bd808: Shutdown tools-exec-140* ([[phab:T217152|T217152]])
* 19:12 bd808: Shutdown tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 19:11 bd808: Shutdown tools-webgrid-lighttpd-14* ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-master ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-shadow ([[phab:T217152|T217152]])
* 18:49 bd808: All jobs still running on the Trusty job grid force deleted.
* 18:46 bd808: All Trusty job grid queues marked as disabled. This should stop all new Trusty job submissions.
* 18:43 arturo: icinga downtime tools-checker for 24h due to trusty grid shutdown
* 18:39 bd808: Shutdown tools-cron-01.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 15:27 bd808: Copied all crontab files still on tools-cron-01 to tool's $HOME/crontab.trusty.save
* 02:34 bd808: Disassociated floating IPs and deleted shutdown Trusty grid nodes tools-exec-14{33,34,35,36,37,38,39,40,41,42} ([[phab:T217152|T217152]])
* 02:26 bd808: Deleted shutdown Trusty grid nodes tools-webgrid-lighttpd-14{20,21,22,24,25,26,27,28} ([[phab:T217152|T217152]])
 
=== 2019-03-22 ===
* 17:16 andrewbogott: switching all instances to use ldap-ro.eqiad.wikimedia.org as both primary and secondary ldap server
* 16:12 bstorm_: cleared errored out stretch grid queues
* 15:56 bd808: Rebooting tools-static-12
* 03:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted 15 other nodes.  Entire stretch grid is in a good state for now.
* 02:31 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening
* 02:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0924
* 00:39 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0902
 
=== 2019-03-21 ===
* 23:28 bstorm_: [[phab:T217280|T217280]] depooled, reloaded and repooled tools-sgeexec-0938
* 21:53 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling
* 21:51 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling
* 21:26 bstorm_: [[phab:T217280|T217280]] cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related
 
=== 2019-03-18 ===
* 18:43 bd808: Rebooting tools-static-12
* 18:42 chicocvenancio: PAWS: 3 nodes still in not ready state, `worker-10(01{{!}}07{{!}}10)` all else working
* 18:41 chicocvenancio: PAWS: deleting pods stuck in Unknown state with ` --grace-period=0 --force`
* 18:40 andrewbogott: rebooting tools-static-13 in hopes of fixing some nfs mounts
* 18:25 chicocvenancio: removing postStart hook for PWB update and restarting hub while gerrit.wikimedia.com is down
 
=== 2019-03-17 ===
* 23:41 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497210/ as a quick fix for [[phab:T218494|T218494]]
* 22:30 bd808: Investigating strange system state on tools-bastion-03.
* 17:48 bstorm_: [[phab:T218514|T218514]] rebooting tools-worker-1009 and 1012
* 17:46 bstorm_: depooling tools-worker-1009 and tools-worker-1012 for [[phab:T218514|T218514]]
* 17:13 bstorm_: depooled and rebooting tools-worker-1018
* 15:09 andrewbogott: running 'killall dpkg and dpkg --configure -a' on all nodes to try to work around a race with initramfs
 
=== 2019-03-16 ===
* 22:34 bstorm_: clearing errored out queues again
 
=== 2019-03-15 ===
* 21:08 bstorm_: cleared error state on several queues [[phab:T217280|T217280]]
* 15:58 gtirloni: rebooted tools-clushmaster-02
* 14:40 mutante: tools-sgebastion-07 - dpkg-reconfigure locales and adding Korean ko_KR.EUC-KR - [[phab:T130532|T130532]]
* 14:32 mutante: tools-sgebastion-07 - generating locales for user request in [[phab:T130532|T130532]]
 
=== 2019-03-14 ===
* 23:52 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{21,22,23,24,25,26,27,28,29,30,31,32} ([[phab:T217152|T217152]])
* 23:28 bd808: Deleted tools-bastion-05 ([[phab:T217152|T217152]])
* 22:30 bd808: Removed obsolete submit hosts from Trusty grid config
* 22:20 bd808: Removed tools-webgrid-lighttpd-142{0,1,2,5} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 22:10 bd808: Depooled tools-webgrid-lighttpd-142{0,1,2,5} ([[phab:T217152|T217152]])
* 21:55 bd808: Removed submit host flag from tools-bastion-05.tools.eqiad.wmflabs, removed floating ip, and shutdown instance via horizon ([[phab:T217152|T217152]])
* 21:48 bd808: Removed tools-exec-14{33,34,35,36,37,38,39,40,41,42} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 21:38 gtirloni: rebooted tools-sgewebgrid-generic-0904 ([[phab:T218341|T218341]])
* 21:32 gtirloni: rebooted tools-exec-1020 ([[phab:T218341|T218341]])
* 21:23 gtirloni: rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 ([[phab:T218341|T218341]])
* 21:19 bd808: Killed jobs still running on tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs 2 weeks after being depooled ([[phab:T217152|T217152]])
* 20:58 bd808: Repooled tools-sgeexec-0941 following reboot
* 20:57 bd808: Hard reboot of tools-sgeexec-0941 via horizon
* 20:54 bd808: Depooled and rebooted tools-sgeexec-0941.tools.eqiad.wmflabs
* 20:53 bd808: Repooled tools-sgeexec-0917 following reboot
* 20:52 bd808: Hard reboot of tools-sgeexec-0917 via horizon
* 20:47 bd808: Depooled and rebooted tools-sgeexec-0917
* 20:44 bd808: Repooled tools-sgeexec-0908 after reboot
* 20:36 bd808: depooled and rebooted tools-sgeexec-0908
* 19:08 gtirloni: rebooted tools-worker-1028 ([[phab:T218341|T218341]])
* 19:08 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914 ([[phab:T218341|T218341]])
* 19:07 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914
* 18:13 gtirloni: drained tools-worker-1028 for reboot (processes in D state)
 
=== 2019-03-13 ===
* 23:30 bd808: Rebuilding stretch Kubernetes images
* 22:55 bd808: Rebuilding jessie Kubernetes images
* 17:11 bstorm_: specifically rebooted SGE cron server tools-sgecron-01
* 17:10 bstorm_: rebooted cron server
* 16:10 bd808: Updated DNS for dev.tools.wmflabs.org to point to Stretch secondary bastion. This was missed on 2019-03-07
* 12:33 arturo: reboot tools-sgebastion-08 ([[phab:T215154|T215154]])
* 12:17 arturo: reboot tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:53 arturo: enable puppet in tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:20 arturo: disable puppet in tools-sgebastion-07 for testing [[phab:T215154|T215154]]
* 05:07 bstorm_: re-enabled puppet for tools-sgebastion-07
* 04:59 bstorm_: disabled puppet for a little bit on tools-bastion-07
* 00:22 bd808: Raise web-memlimit for isbn tool to 6G for tomcat8 ([[phab:T217406|T217406]])
 
=== 2019-03-11 ===
* 15:53 bd808: Manually started `service gridengine-master` on tools-sgegrid-master after reboot ([[phab:T218038|T218038]])
* 15:47 bd808: Hard reboot of tools-sgegrid-master via Horizon UI ([[phab:T218038|T218038]])
* 15:42 bd808: Rebooting tools-sgegrid-master ([[phab:T218038|T218038]])
* 14:49 gtirloni: deleted tools-webgrid-lighttpd-1419
* 00:53 bd808: Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
 
=== 2019-03-10 ===
* 22:36 gtirloni: increased nscd group TTL from 60 to 300sec
 
=== 2019-03-08 ===
* 19:48 andrewbogott: repooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 19:21 andrewbogott: depooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 17:49 bd808: Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
* 00:30 bd808: DNS record created for trusty-dev.tools.wmflabs.org (Trusty secondary bastion)
 
=== 2019-03-07 ===
* 23:31 bd808: Updated DNS to point login.tools.wmflabs.org at 185.15.56.48 (Stretch bastion)
* 04:15 bd808: Killed 3 orphan processes on Trusty grid
* 04:01 bd808: Cleared error state on a large number of Stretch grid queues which had been disabled by LDAP and/or NFS hiccups ([[phab:T217280|T217280]])
* 00:49 zhuyifei1999_: clushed misctools 1.37 upgrade on @bastion,@cron,@bastion-stretch [[phab:T217406|T217406]]
* 00:38 zhuyifei1999_: published misctools 1.37 [[phab:T217406|T217406]]
* 00:34 zhuyifei1999_: begin building misctools 1.37 using debuild [[phab:T217406|T217406]]
 
=== 2019-03-06 ===
* 13:57 gtirloni: fixed SSH warnings in tools-clushmaster-02
 
=== 2019-03-04 ===
* 19:07 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org for [[phab:T217473|T217473]]
* {{safesubst:SAL entry|1=14:05 gtirloni: rebooted tools-docker-registry-{03,04}, tools-puppetmaster-02 and tools-puppetdb-01 (load avg >45, not accessible)}}
 
=== 2019-03-03 ===
* 20:54 andrewbogott: cleaning out /tmp on tools-exec-1412
 
=== 2019-02-28 ===
* 19:36 zhuyifei1999_: built with debuild instead [[phab:T217297|T217297]]
* 19:08 zhuyifei1999_: test failures during build, see ticket
* 18:55 zhuyifei1999_: start building jobutils 1.36 [[phab:T217297|T217297]]
 
=== 2019-02-27 ===
* 20:41 andrewbogott: restarting nginx on tools-checker-01
* 19:34 andrewbogott: uncordoning tools-worker-1028, 1002 and 1005, now in eqiad1-r
* 16:20 zhuyifei1999_: regenerating k8s creds for tools.whichsub & tools.permission-denied-test [[phab:T176027|T176027]]
* 15:40 andrewbogott: moving tools-worker-1002, 1005, 1028 to eqiad1-r
* 01:36 bd808: Shutdown tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs via horizon ([[phab:T217152|T217152]])
* 01:29 bd808: Depooled tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 01:26 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
 
=== 2019-02-26 ===
* 20:51 gtirloni: reboot tools-package-builder-02 (unresponsive)
* 19:01 gtirloni: pushed updated docker images
* 17:30 andrewbogott: draining and cordoning tools-worker-1027 for a region migration test
 
=== 2019-02-25 ===
* 23:20 bstorm_: Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for [[phab:T217066|T217066]]
* 21:41 andrewbogott: depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test [[phab:T217066|T217066]]
* 13:11 chicocvenancio: PAWS:  Stopped AABot notebook pod [[phab:T217010|T217010]]
* 12:54 chicocvenancio: PAWS:  Restarted Criscod notebook pod [[phab:T217010|T217010]]
* 12:21 chicocvenancio: PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod [[phab:T217010|T217010]]
* 09:50 gtirloni: rebooted tools-sgeexec-09{16,22,40} ([[phab:T216988|T216988]])
* 09:41 gtirloni: rebooted tools-sgeexec-09{16,22,40}
* 08:37 zhuyifei1999_: uncordon tools-worker-1015.tools.eqiad.wmflabs
* 08:34 legoktm: hard rebooted tools-worker-1015 via horizon
* 07:48 zhuyifei1999_: systemd stuck in D state. :(
* 07:44 zhuyifei1999_: I saved dmesg and process list to a few files in /root if that helps debugging
* 07:43 zhuyifei1999_: D states are not responding to SIGKILL. Will reboot.
* 07:37 zhuyifei1999_: tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining.
 
=== 2019-02-22 ===
* 16:29 gtirloni: upgraded and rebooted tools-puppetmaster-01 (new kernel)
* 15:59 gtirloni: started tools-puppetmaster-01 (new size: m1.large)
* 15:13 gtirloni: shutdown tools-puppetmaster-01
 
=== 2019-02-21 ===
* 09:59 gtirloni: upgraded all packages in all stretch nodes
* 00:12 zhuyifei1999_: forcing puppet run on tools-k8s-master-01
* 00:08 zhuyifei1999_: running /usr/local/bin/git-sync-upstream on tools-puppetmaster-01 to speed puppet changes up
 
=== 2019-02-20 ===
* 23:30 zhuyifei1999_: begin rebuilding all docker images [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 23:25 zhuyifei1999_: upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version)
* 23:19 zhuyifei1999_: this was built for stretch. hopefully it works for all distros
* 23:17 zhuyifei1999_: begin build new tools-webservice package [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 21:57 andrewbogott: moving tools-static-13  to a new virt host
* 21:34 andrewbogott: moving the tools-static IP from tools-static-13 to tools-static-12
* 19:17 andrewbogott: moving tools-bastion-02 to labvirt1004
* 16:56 andrewbogott: moving tools-paws-worker-1003
* 15:53 andrewbogott: moving tools-worker-1017, tools-worker-1027, tools-worker-1028
* 15:04 andrewbogott: moving tools-exec-1413 and tools-exec-1442
 
=== 2019-02-19 ===
* 01:49 bd808: Revoked Toolforge project membership for user DannyS712 ([[phab:T215092|T215092]])
 
=== 2019-02-18 ===
* 20:45 gtirloni: upgraded and rebooted tools-sgebastion-07 (login-stretch)
* 20:22 gtirloni: enabled toolsdb monitoring in Icinga
* 20:03 gtirloni: pointed tools-db.eqiad.wmflabs to 172.16.7.153
* 18:50 chicocvenancio: moving paws back to toolsdb [[phab:T216208|T216208]]
* 13:47 arturo: rebooting tools-sgebastion-07 to try fixing general slowness
 
=== 2019-02-17 ===
* 22:23 zhuyifei1999_: uncordon tools-worker-1010.tools.eqiad.wmflabs
* 22:11 zhuyifei1999_: rebooting tools-worker-1010.tools.eqiad.wmflabs
* 22:10 zhuyifei1999_: draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever
 
=== 2019-02-16 ===
* 05:00 zhuyifei1999_: fixed by restarting flannel. another puppet run simply started kubelet
* 04:58 zhuyifei1999_: puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory'
* 04:52 zhuyifei1999_: copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet
* 04:48 zhuyifei1999_: that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago)
* 04:44 zhuyifei1999_: puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known'
* 04:43 zhuyifei1999_: this one has logs full of 'Can't contact LDAP server'
* 04:41 zhuyifei1999_: nslcd also broken on tools-worker-1005
* 04:34 zhuyifei1999_: uncordon tools-worker-1014.tools.eqiad.wmflabs
* 04:33 zhuyifei1999_: the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT
* 04:31 zhuyifei1999_: then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs
* 04:30 zhuyifei1999_: `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work
* 04:23 zhuyifei1999_: drained tools-worker-1014.tools.eqiad.wmflabs
* 04:16 zhuyifei1999_: logs: https://phabricator.wikimedia.org/P8095
* 04:14 zhuyifei1999_: restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs
* 04:12 zhuyifei1999_: restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP
 
=== 2019-02-14 ===
* 21:57 bd808: Deleted old tools-proxy-02 instance
* 21:57 bd808: Deleted old tools-proxy-01 instance
* 21:56 bd808: Deleted old tools-package-builder-01 instance
* 20:57 andrewbogott: rebooting tools-worker-1005
* 20:34 andrewbogott: moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419
* 19:55 andrewbogott: moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419
* 19:33 andrewbogott: moving tools-checker-01 to labvirt1003
* 19:25 andrewbogott: moving tools-elastic-02 to labvirt1003
* 19:11 andrewbogott: moving tools-k8s-etcd-01 to labvirt1002
* 18:37 andrewbogott: moving tools-exec-1418, tools-exec-1424 to labvirt1003
* 18:34 andrewbogott: moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002
* 17:35 arturo: [[phab:T215154|T215154]] tools-sgebastion-07 now running systemd 239 and starts enforcing user limits
* 15:33 andrewbogott: moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r
 
=== 2019-02-13 ===
* 19:16 andrewbogott: deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06
* 15:16 zhuyifei1999_: `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml{{!}}awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 15:06 zhuyifei1999_: `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 13:03 arturo: [[phab:T216030|T216030]] switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07
 
=== 2019-02-12 ===
* 01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers ([[phab:T215704|T215704]])
 
=== 2019-02-11 ===
* 22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
* 22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
* 22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
* 22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
* 22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
* 22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
* 22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
* 20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
* 19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 ([[phab:T107878|T107878]])
* 18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 ([[phab:T107878|T107878]])
* 18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1
 
=== 2019-02-08 ===
* 19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
* 18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for [[phab:T210829|T210829]].
* 13:49 gtirloni: upgraded all packages in SGE cluster
* 12:25 arturo: install aptitude in tools-sgebastion-06
* 11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - [[phab:T215272|T215272]]
* 01:07 bd808: Creating tools-sgebastion-07
 
=== 2019-02-07 ===
* 23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
* 20:18 gtirloni: cleared mail queue on tools-mail-02
* 08:41 godog: upgrade prometheus-02 to prometheus 2.6 - [[phab:T215272|T215272]]
 
=== 2019-02-04 ===
* 13:20 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06
* 12:26 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06. Puppet is disabled
* 11:38 arturo: [[phab:T215154|T215154]] reboot tools-sgebastion-06 to totally refresh systemd status
* 11:36 arturo: [[phab:T215154|T215154]] manually install systemd 239 in tools-sgebastion-06
 
=== 2019-01-30 ===
* 23:54 gtirloni: cleared apt cache on sge* hosts
 
=== 2019-01-25 ===
* 20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch ([[phab:T214668|T214668]])
* 14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for [[phab:T214447|T214447]]
* 14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for [[phab:T214447|T214447]]
 
=== 2019-01-24 ===
* 11:09 arturo: [[phab:T213421|T213421]] delete tools-services-01/02
* 09:46 arturo: [[phab:T213418|T213418]] delete tools-docker-registry-02
* 09:45 arturo: [[phab:T213418|T213418]] delete tools-docker-builder-05 and tools-docker-registry-01
* 03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01
 
=== 2019-01-23 ===
* 22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image ([[phab:T214519|T214519]])
* 22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image ([[phab:T214519|T214519]])
* 21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance ([[phab:T214519|T214519]])
* 20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon ([[phab:T214519|T214519]])
* 19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
* 17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 ([[phab:T211684|T211684]])
 
=== 2019-01-22 ===
* 20:21 gtirloni: published new docker images (all)
* 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs
 
=== 2019-01-21 ===
* 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet
 
=== 2019-01-18 ===
* 21:22 bd808: Forcing php-igbinary update via clush for [[phab:T213666|T213666]]
 
=== 2019-01-17 ===
* 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
* 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
* 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
* 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
* 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
* 17:16 arturo: [[phab:T213421|T213421]] shutdown tools-services-01/02. Will delete VMs after a grace period
* 12:54 arturo: add webservice security group to tools-sge-services-03/04
 
=== 2019-01-16 ===
* 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
* 16:38 arturo: [[phab:T213418|T213418]] shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
* 14:34 arturo: [[phab:T213418|T213418]] point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
* 14:24 arturo: [[phab:T213418|T213418]] allocate floating IPs for tools-docker-registry-03 & 04
 
=== 2019-01-15 ===
* 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
* 18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
* 18:29 bstorm_: [[phab:T213711|T213711]] installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
* 14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
* 14:21 arturo: [[phab:T213418|T213418]] put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`
 
=== 2019-01-14 ===
* 22:03 bstorm_: [[phab:T213711|T213711]] Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
* 22:03 bstorm_: [[phab:T213711|T213711]] Added ports needed for etcd-flannel to work on the etcd security group in eqiad
* 21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
* 21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
* 21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
* 21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
* 16:44 arturo: [[phab:T213418|T213418]] docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
* 14:00 arturo: [[phab:T213421|T213421]] disable updatetools in the new services nodes while building them
* 13:53 arturo: [[phab:T213421|T213421]] delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
* 13:47 arturo: [[phab:T213421|T213421]] create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`
 
=== 2019-01-11 ===
* 11:55 arturo: [[phab:T213418|T213418]] shutdown tools-docker-builder-05, will give a grace period before deleting the VM
* 10:51 arturo: [[phab:T213418|T213418]] created tools-docker-builder-06 in eqiad1
* 10:46 arturo: [[phab:T213418|T213418]] migrating tools-docker-registry-02 from eqiad to eqiad1
 
=== 2019-01-10 ===
* 22:45 bstorm_: [[phab:T213357|T213357]] - Added 24 lighttpd nodes tot he new grid
* 18:54 bstorm_: [[phab:T213355|T213355]] built and configured two more generic web nodes for the new grid
* 10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
* 00:12 bstorm_: [[phab:T213353|T213353]] Added 36 exec nodes to the new grid
 
=== 2019-01-09 ===
* 20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
* 17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
* 14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
* 09:59 gtirloni: rebooted tools-checker-01 ([[phab:T213252|T213252]])
 
=== 2019-01-07 ===
* 17:21 bstorm_: [[phab:T67777|T67777]] - set the max_u_jobs global grid config setting to 50 in the new grid
* 15:54 bstorm_: [[phab:T67777|T67777]] Set stretch grid user job limit to 16
* 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.
 
=== 2019-01-06 ===
* 22:06 bd808: Added floating ip to tools-sgebastion-06 ([[phab:T212360|T212360]])
 
=== 2019-01-05 ===
* 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.
 
=== 2019-01-04 ===
* 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history
 
=== 2019-01-03 ===
* 21:03 bd808: Enabled Puppet on tools-proxy-02
* 20:53 bd808: Disabled Puppet on tools-proxy-02
* 20:51 bd808: Enabled Puppet on tools-proxy-01
* 20:49 bd808: Disabled Puppet on tools-proxy-01
 
=== 2018-12-21 ===
* 16:29 andrewbogott: migrating tools-exec-1416  to labvirt1004
* 16:01 andrewbogott: moving tools-grid-master to labvirt1004
* 00:35 bd808: Installed tools-manifest 0.14 for [[phab:T212390|T212390]]
* 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for [[phab:T212390|T212390]]
* 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for [[phab:T212390|T212390]]
* 00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for [[phab:T212390|T212390]]
 
=== 2018-12-20 ===
* 20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
* 20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
* 20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002
 
=== 2018-12-17 ===
* 22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - [[phab:T212153|T212153]]
* 19:18 gtirloni: decreased nfs-mount-manager verbosity ([[phab:T211817|T211817]])
* 19:02 arturo: [[phab:T211977|T211977]] add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
* 13:46 arturo: [[phab:T211977|T211977]] `aborrero@tools-services-01:~$  sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`
 
=== 2018-12-11 ===
* 13:19 gtirloni: Removed BigBrother ([[phab:T208357|T208357]])
 
=== 2018-12-05 ===
* 12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster ([[phab:T196973|T196973]])
 
=== 2018-12-04 ===
* 22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage [[phab:T164123|T164123]]
* 20:03 gtirloni: removed floating IPs from tools-k8s-master-01 ([[phab:T164123|T164123]])
 
=== 2018-12-01 ===
* 02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 ([[phab:T194615|T194615]])
* 00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts
 
=== 2018-11-30 ===
* 23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
* 22:18 gtirloni: Pushed new jdk8 docker image based on stretch ([[phab:T205774|T205774]])
* 18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance ([[phab:T194615|T194615]])
 
=== 2018-11-27 ===
* 17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb
 
=== 2018-11-26 ===
* 17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) ([[phab:T210190|T210190]])
* 17:34 gtirloni: [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (again)
* 13:31 gtirloni: deleted instance tools-clushmaster-01 ([[phab:T209701|T209701]])
 
=== 2018-11-20 ===
* 23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
* 14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
* 13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
* 10:52 arturo: [[phab:T208579|T208579]] distributing now misctools and jobutils 1.33 in all aptly repos
* 09:43 godog: restart prometheus@tools on prometheus-01
 
=== 2018-11-16 ===
* 21:16 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
* 17:47 gtirloni: deleted tools-mail instance
* 17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
* 17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
* 13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades
 
=== 2018-11-14 ===
* 17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
* 17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
* 17:15 andrewbogott: moving tools-exec-1420 to labvirt1009
 
=== 2018-11-13 ===
* 17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo ([[phab:T207970|T207970]])
* 13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
* 13:29 gtirloni: Changed active mail relay to tools-mail-02 ([[phab:T209356|T209356]])
* 13:22 arturo: [[phab:T207970|T207970]] misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
* 13:05 arturo: [[phab:T207970|T207970]] there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
* 12:59 arturo: the puppet issue has been solved by reverting the code
* 12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit
 
=== 2018-11-08 ===
* 18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
* 17:58 arturo: installing jobutils and misctools v1.32 ([[phab:T207970|T207970]])
* 17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
* 16:56 gtirloni: cleaned up /tmp on tools-bastion-05
* 16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
* 16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
* 11:32 gtirloni: removed temporary /var/mail fix ([[phab:T208843|T208843]])
 
=== 2018-11-07 ===
* 10:37 gtirloni: removed invalid apt.conf.d file from all hosts ([[phab:T110055|T110055]])
 
=== 2018-11-02 ===
* 18:11 arturo: [[phab:T206223|T206223]] some disturbances due to the certificate renewal
* 17:04 arturo: renewing *.wmflabs.org [[phab:T206223|T206223]]
 
=== 2018-10-31 ===
* 18:02 gtirloni: truncated big .err and error.log files
* 13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde
 
=== 2018-10-29 ===
* 17:00 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]
 
=== 2018-10-26 ===
* 10:34 arturo: [[phab:T207970|T207970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
* 10:32 arturo: [[phab:T209970|T209970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
 
=== 2018-10-19 ===
* 14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
* 00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017
 
=== 2018-10-18 ===
* 19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017
 
=== 2018-10-16 ===
* 15:13 bd808: (repost for gtirloni) [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (leftover from [[phab:T165624|T165624]] legofan4000->macfan4000 rename)
 
=== 2018-10-07 ===
* 21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 [[phab:T194859|T194859]]
* 21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
* 21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens
 
=== 2018-09-21 ===
* 12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
* 12:14 arturo: [[phab:T205078|T205078]] same for {jessie,stretch}-wikimedia
* 12:12 arturo: [[phab:T205078|T205078]] upgrade trusty-wikimedia packages (git-fat, debmonitor)
* 11:57 arturo: [[phab:T205078|T205078]] purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines
 
=== 2018-09-17 ===
* 09:13 arturo: [[phab:T204481|T204481]] aborrero@tools-mail:~$ sudo exiqgrep -i {{!}} xargs sudo exim -Mrm
 
=== 2018-09-14 ===
* 11:22 arturo: [[phab:T204267|T204267]] stop the corhist tool (k8s) because is hammering the wikidata API
* 10:51 arturo: [[phab:T204267|T204267]] stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API
 
=== 2018-09-08 ===
* 10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog ([[phab:T196137|T196137]])
 
=== 2018-09-07 ===
* 05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb
 
=== 2018-08-27 ===
* 23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
* 23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
* 22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
 
=== 2018-08-22 ===
* 13:02 arturo: I used this command: `sudo exim -bp {{!}} sudo exiqgrep -i {{!}} xargs sudo exim -Mrm`
* 13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com
 
=== 2018-08-19 ===
* 09:12 legoktm: rebuilding python/base k8s images for https://gerrit.wikimedia.org/r/453665 ([[phab:T202218|T202218]])
 
=== 2018-08-14 ===
* 21:02 legoktm: rebuilt php7.2 docker images for https://gerrit.wikimedia.org/r/452755
* 01:08 legoktm: switched tools.coverme and tools.wikiinfo to use PHP 7.2
 
=== 2018-08-13 ===
* 23:31 legoktm: rebuilding docker images for webservice upgrade
* 23:16 legoktm: published toollabs-webservice_0.41_all.deb
* 23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice
 
=== 2018-08-09 ===
* 10:40 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-backports (excluding python-designateclient)
* 10:30 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-wikimedia
* 10:27 arturo: [[phab:T201602|T201602]] upgrade packages from trusty-updates
 
=== 2018-08-08 ===
* 10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images [[phab:T156626|T156626]] [[phab:T148872|T148872]] [[phab:T158244|T158244]]
 
=== 2018-08-06 ===
* 12:33 arturo: [[phab:T197176|T197176]] installing texlive-full in toolforge
 
=== 2018-08-01 ===
* 14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break
 
=== 2018-07-30 ===
* 20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
* 04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools
 
=== 2018-07-27 ===
* 04:52 zhuyifei1999_: rebuilding python/base docker container [[phab:T190274|T190274]]
 
=== 2018-07-25 ===
* 19:02 chasemp: tools-worker-1004 reboot
* 19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)
 
=== 2018-07-18 ===
* 13:24 arturo: upgrading packages from `stretch-wikimedia` [[phab:T199905|T199905]]
* 13:18 arturo: upgrading packages from `stable` [[phab:T199905|T199905]]
* 12:51 arturo: upgrading packages from `oldstable` [[phab:T199905|T199905]]
* 12:31 arturo: upgrading packages from `trusty-updates` [[phab:T199905|T199905]]
* 12:16 arturo: upgrading packages from `jessie-wikimedia` [[phab:T199905|T199905]]
* 12:09 arturo: upgrading packages from `trusty-wikimedia` [[phab:T199905|T199905]]
 
=== 2018-06-30 ===
* 18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
* 16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
* 16:39 zhuyifei1999_: reboot tools-paws-master-01
* 16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
* 16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere
 
=== 2018-06-29 ===
* 17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
* 17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU ([[phab:T123121|T123121]])
* 16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. [[phab:T182070|T182070]]
 
=== 2018-06-28 ===
* 19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
* 18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
* 17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
* 17:20 chasemp: tools-worker-1007:~# /sbin/reboot
* 16:48 arturo: rebooting tools-docker-registry-01
* 16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
* 16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck
 
=== 2018-06-21 ===
* 13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash
 
=== 2018-06-20 ===
* 15:09 bd808: Killed orphan processes on webgrid nodes ([[phab:T182070|T182070]]); most owned by jembot and croptool
 
=== 2018-06-14 ===
* 14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash
 
=== 2018-06-11 ===
* 10:11 arturo: [[phab:T196137|T196137]] `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null {{!}} grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart {{!}}{{!}} true'`
 
=== 2018-06-08 ===
* 07:46 arturo: [[phab:T196137|T196137]] more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes
 
=== 2018-06-07 ===
* 11:01 arturo: [[phab:T196137|T196137]] force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`
 
=== 2018-06-06 ===
* 22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt ([[phab:T196589|T196589]])
* 21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
* 20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
* 19:04 chasemp: tools-bastion-03 is virtually unusable
* 09:49 arturo: [[phab:T196137|T196137]] aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid
 
=== 2018-06-05 ===
* 18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben ([[phab:T196486|T196486]])
* 17:39 arturo: [[phab:T196137|T196137]] clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
* 17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs ([[phab:T196486|T196486]])
 
=== 2018-06-04 ===
* 10:28 arturo: [[phab:T196006|T196006]] installing sqlite3 package in exec nodes
 
=== 2018-06-03 ===
* 10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' [[phab:T195834|T195834]]
 
=== 2018-05-31 ===
* 11:31 zhuyifei1999_: building & pushing python/web docker image [[phab:T174769|T174769]]
* 11:13 zhuyifei1999_: force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101
 
=== 2018-05-30 ===
* 10:52 zhuyifei1999_: undid both changes to tools-bastion-05
* 10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
* 10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close [[phab:T195834|T195834]]
 
=== 2018-05-28 ===
* 12:09 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
* 12:06 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for trusty-wikimedia
 
=== 2018-05-25 ===
* 05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty [[phab:T195558|T195558]]
 
=== 2018-05-22 ===
* 11:53 arturo: running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for [[phab:T194665|T194665]] (mono framework update)
 
=== 2018-05-18 ===
* 16:36 bd808: Restarted bigbrother on tools-services-02
 
=== 2018-05-16 ===
* 21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds
 
=== 2018-05-15 ===
* 04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414.  It's hanging for unknown reasons.
* 04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
* 04:05 zhuyifei1999_: Force deletion of grid job {{Gerrit|5221417}} (tools.giftbot sga), host tools-exec-1414 not responding
 
=== 2018-05-12 ===
* 10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop {{!}} [[phab:T194343|T194343]]
 
=== 2018-05-11 ===
* 14:34 andrewbogott: repooling labvirt1001 tools instances
* 13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for [[phab:T194258|T194258]]:  tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
 
=== 2018-05-10 ===
* 18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update
 
=== 2018-05-09 ===
* 21:11 Reedy: Added Tim Starling as member/admin
 
=== 2018-05-07 ===
* 21:02 zhuyifei1999_: re-building all docker images [[phab:T190893|T190893]]
* 20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 [[phab:T190893|T190893]]
* 00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours
 
=== 2018-05-05 ===
* 23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing
 
=== 2018-05-03 ===
* 14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package [[phab:T192566|T192566]]
 
=== 2018-05-01 ===
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)
 
=== 2018-04-27 ===
* 18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
* 18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker
 
=== 2018-04-23 ===
* 14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools [[phab:T192732|T192732]]
 
=== 2018-04-22 ===
* 13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd {{!}} grep -E "    1 " {{!}} grep php-cgi {{!}} xargs sudo kill -9'`
 
=== 2018-04-15 ===
* 17:51 zhuyifei1999_: forced puppet puns across tools-elastic-0[1-3] [[phab:T192224|T192224]]
* 17:45 zhuyifei1999_: granted elasticsearch credentials to tools.flaky-ci [[phab:T192224|T192224]]
 
=== 2018-04-11 ===
* 13:25 chasemp: cleanup exim frozen messages in an effort to aleve queue pressure
 
=== 2018-04-06 ===
* 16:30 chicocvenancio: killed job in bastion, tools.gpy affected
* 14:30 arturo: add puppet class `toollabs::apt_pinning` to tools-puppetmaster-01 using horizon, to add some apt pinning related to [[phab:T159254|T159254]]
* 11:23 arturo: manually upgrade apache2 on tools-puppemaster for [[phab:T159254|T159254]]
 
=== 2018-04-05 ===
* 18:46 chicocvenancio: killed wget that was hogging io
 
=== 2018-03-29 ===
* 20:09 chicocvenancio: killed interactive processes in tools-bastion-03
* 19:56 chicocvenancio: several interactive jobs running in bastion-03. I am writing to connected users and will kill the jobs once done
 
=== 2018-03-28 ===
* 13:06 zhuyifei1999_: SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid
 
=== 2018-03-26 ===
* 21:34 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
 
=== 2018-03-23 ===
* 23:26 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
* 19:43 bd808: tools-proxy-* Forced puppet run to apply https://gerrit.wikimedia.org/r/#/c/421472/
 
=== 2018-03-22 ===
* 22:04 bd808: Forced puppet run on tools-proxy-02 for [[phab:T130748|T130748]]
* 21:52 bd808: Forced puppet run on tools-proxy-01 for [[phab:T130748|T130748]]
* 21:48 bd808: Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout
* 03:50 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
 
=== 2018-03-21 ===
* 17:50 bd808: Cleaned up stale /project/.system/bigbrother.scoreboard.* files from labstore1004
* 01:09 bd808: Deleting /tmp files owned by tools.wsexport with -mtime +2 across grid ([[phab:T190185|T190185]])
 
=== 2018-03-20 ===
* 08:28 zhuyifei1999_: unmount dumps & remount on tools-bastion-02 (can someone clush this?) [[phab:T189018|T189018]] [[phab:T190126|T190126]]
 
=== 2018-03-19 ===
* 11:02 arturo: reboot tools-exec-1408, to balance load. Server is unresponsive due to high load by some tools
 
=== 2018-03-16 ===
* 22:44 zhuyifei1999_: suspended process 22825 (BotOrderOfChapters.exe) on tools-bastion-03. Threads continuously going to D-state & R-state. Also sent message via $ write on pts/10
* 12:13 arturo: reboot tools-webgrid-lighttpd-1420 due to almost full /tmp
 
=== 2018-03-15 ===
* 16:56 zhuyifei1999_: granted elasticsearch credentials to tools.denkmalbot [[phab:T185624|T185624]]
 
=== 2018-03-14 ===
* 20:57 bd808: Upgrading elasticsearch on tools-elastic-01 ([[phab:T181531|T181531]])
* 20:53 bd808: Upgrading elasticsearch on tools-elastic-02 ([[phab:T181531|T181531]])
* 20:51 bd808: Upgrading elasticsearch on tools-elastic-03 ([[phab:T181531|T181531]])
* 12:07 arturo: reboot tools-webgrid-lighttpd-1415, almost full /tmp
* 12:01 arturo: repool tools-webgrid-lighttpd-1421, /tmp is now empty
* 11:56 arturo: depool tools-webgrid-lighttpd-1421 for reboot due to /tmp almost full
 
=== 2018-03-12 ===
* 20:09 madhuvishy: Run clush -w @all -b 'sudo umount /mnt/nfs/labstore1003-scratch && sudo mount -a' to remount scratch across all of tools
* 17:13 arturo: [[phab:T188994|T188994]] upgrading packages from `stable`
* 16:53 arturo: [[phab:T188994|T188994]] upgrading packages from stretch-wikimedia
* 16:33 arturo: [[phab:T188994|T188994]] upgrading packages form jessie-wikimedia
* 14:58 zhuyifei1999_: building, publishing, and deploying misctools 1.31 {{Gerrit|5f3561e}} [[phab:T189430|T189430]]
* 13:31 arturo: tools-exec-1441 and tools-exec-1442 rebooted fine and are repooled
* 13:26 arturo: depool tools-exec-1441 and tools-exec-1442 for reboots
* 13:19 arturo: [[phab:T188994|T188994]] upgrade packages from jessie-backports in all jessie servers
* 12:49 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-updates in all ubuntu servers
* 12:34 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-wikimedia in all ubuntu servers
 
=== 2018-03-08 ===
* 16:05 chasemp: tools-clushmaster-01:~$ clush -g all 'sudo puppet agent --test'
* 14:02 arturo: [[phab:T188994|T188994]] upgrading trusty-tools packages in all the cluster, this includes jobutils, openssh-server and openssh-sftp-server
 
=== 2018-03-07 ===
* 20:42 chicocvenancio: killed io intensive recursive zip of huge folder
* 18:30 madhuvishy: Killed php-cgi job run by user 51242 on tools-webgrid-lighttpd-1413
* 14:08 arturo: just merged NFS package pinning https://gerrit.wikimedia.org/r/#/c/416943/
* 13:47 arturo: deploying more apt pinnings: https://gerrit.wikimedia.org/r/#/c/416934/
 
=== 2018-03-06 ===
* 16:15 madhuvishy: Reboot tools-docker-registry-02 [[phab:T189018|T189018]]
* 15:50 madhuvishy: Rebooting tools-worker-1011
* 15:08 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs
* 15:03 arturo: drain and reboot tools-worker-1011
* 15:03 chasemp: rebooted tools-worker 1001-1008
* 14:58 arturo: drain and reboot tools-worker-1010
* 14:27 chasemp: multiple tools running on k8s workers report issues reading replica.my.cnf file atm
* 14:27 chasemp: reboot tools-worker-100[12]
* 14:23 chasemp: downtime icinga alert for k8s workers ready
* 13:21 arturo: [[phab:T188994|T188994]] in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced
* 12:58 arturo: [[phab:T188994|T188994]] upgrading packages in jessie nodes from the oldstable source
* 11:42 arturo: clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem
* 11:41 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide
* 11:36 arturo: (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic ([[phab:T188911|T188911]])
* 11:33 arturo: removing unused kernel packages in ubuntu nodes
* 11:08 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster
 
=== 2018-03-05 ===
* 18:56 zhuyifei1999_: also published jobutils_1.30_all.deb
* 18:39 zhuyifei1999_: built and published misctools_1.30_all.deb [[phab:T167026|T167026]] [[phab:T181492|T181492]]
* 14:33 arturo: delete `linux-image-4.9.0-6-amd64` package from stretch instances for [[phab:T188911|T188911]]
* 14:01 arturo: deleting old kernel packages in jessie instances for [[phab:T188911|T188911]]
* 13:58 arturo: running `apt-get autoremove` with clush in all jessie instances
* 12:16 arturo: apply role::toollabs::base to tools-paws prefix in horizon for [[phab:T187193|T187193]]
* 12:10 arturo: apply role::toollabs::base to tools-prometheus prefix in horizon for [[phab:T187193|T187193]]
 
=== 2018-03-02 ===
* 13:41 arturo: doing some testing with puppet classes in tools-package-builder-01 via horizon
 
=== 2018-03-01 ===
* 13:27 arturo: deploy https://gerrit.wikimedia.org/r/#/c/415057/
 
=== 2018-02-27 ===
* 17:37 chasemp: add chico as admin to toolsbeta
* 12:23 arturo: running `apt-get autoclean` in canary servers
* 12:16 arturo: running `apt-get autoremove` in canary servers
 
=== 2018-02-26 ===
* 19:17 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --test"
* 10:35 arturo: enable puppet in tools-proxy-01
* 10:23 arturo: disable puppet in tools-proxy-01 for apt pinning tests
 
=== 2018-02-25 ===
* 19:04 chicocvenancio: killed jobs in tools-bastion-03, wrote notice to tools owners' terminals
 
=== 2018-02-23 ===
* 19:11 arturo: enable puppet in tools-proxy-01
* 18:53 arturo: disable puppet in tools-proxy-01 for apt preferences testing
* 13:52 arturo: deploying https://gerrit.wikimedia.org/r/#/c/413725/ across the fleet
* 13:04 arturo: install apt-rdepends in tools-paws-master-01 which triggered some python libs to be upgraded
 
=== 2018-02-22 ===
* 16:31 bstorm_: Enabled puppet on tools-static-12 as the test server
 
=== 2018-02-21 ===
* 19:02 bstorm_: disabled puppet on tools-static-* pending change 413197
* 18:15 arturo: puppet should be fine across the fleet
* 17:24 arturo: another try: merged https://gerrit.wikimedia.org/r/#/c/413202/
* 17:02 arturo: revert last change https://gerrit.wikimedia.org/r/#/c/413198/
* 16:59 arturo: puppet is broken across the cluster due to last change
* 16:57 arturo: deploying https://gerrit.wikimedia.org/r/#/c/410177/
* 16:26 bd808: Rebooting tools-docker-registry-01, NFS mounts are in a bad state
* 11:43 arturo: package upgrades in tools-webgrid-lightttpd-1401
* 11:35 arturo: package upgrades in tools-package-builder-01 tools-prometheus-01 tools-static-10 and tools-redis-1001
* 11:22 arturo: package upgrades in tools-mail, tools-grid-master, tool-logs-02
* 10:51 arturo: package upgrades in tools-checker-01 tools-clushmaster-01 and tools-docker-builder-05
* 09:18 chicocvenancio: killed io intensive tool job in bastion
* 03:32 zhuyifei1999_: removed /data/project/.elasticsearch.ini, owned by root and mode 644, leaks the creds of /data/project/strephit/.elasticsearch.ini Might need to cycle it as well...
 
=== 2018-02-20 ===
* 12:42 arturo: upgrading tools-flannel-etcd-01
* 12:42 arturo: upgrading tools-k8s-etcd-01
 
=== 2018-02-19 ===
* 19:13 arturo: upgrade all packages of tools-services-01
* 19:02 arturo: move tools-services-01 from puppet3 to puppet4 (manual package upgrade). No issues detected.
* 18:23 arturo: upgrade packages of tools-cron-01 from all channels (trusty-wikimedia, trusty-updates and trusty-tools)
* 12:54 arturo: puppet run with clush to ensure puppet is back to normal after being broken due to duplicated python3 declaration
 
=== 2018-02-16 ===
* 18:21 arturo: upgrading tools-proxy-01 and tools-paws-master-01, same as others
* 17:36 arturo: upgrading oldstable, jessie-backports, jessie-wikimedia packages in tools-k8s-master-01 (excluding linux*, libpam*, nslcd)
* 13:00 arturo: upgrades in tools-exec-14[01-10].eqiad.wmflabs were fine
* 12:42 arturo: aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'DEBIAN_FRONTEND=noninteractive sudo apt-upgrade -u upgrade trusty-updates -y'
* 11:58 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-wikimedia -y
* 11:57 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-backports -y
* 11:53 arturo: (10 exec canary nodes) aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'sudo apt-upgrade -u upgrade trusty-wikimedia -y'
* 11:41 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade oldstable -y
 
=== 2018-02-15 ===
* 13:54 arturo: cleanup ferm (deinstall) in tools-services-01 for [[phab:T187435|T187435]]
* 13:28 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-tools
* 13:16 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-updates -y
* 13:13 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
* 13:06 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-tools
* 12:57 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-updates
* 12:51 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
 
=== 2018-02-14 ===
* 13:09 arturo: the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment ([[phab:T187315|T187315]])
* 13:04 arturo: reboot tools-paws-master-01 for [[phab:T187315|T187315]]
 
=== 2018-02-11 ===
* 01:28 zhuyifei1999_: `# find /home/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected: only /home/tr8dr, mode 0777 -> 0775
* 01:21 zhuyifei1999_: `# find /data/project/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected tools: wikisource-tweets, gsociftttdev, dow, ifttt-testing, elobot. All mode 2777 -> 2775
 
=== 2018-02-09 ===
* 10:35 arturo: deploy https://gerrit.wikimedia.org/r/#/c/409226/ [[phab:T179343|T179343]] [[phab:T182562|T182562]] [[phab:T186846|T186846]]
* 06:15 bd808: Killed orphan processes owned by iabot, dupdet, and wsexport scattered across the webgrid nodes
* 05:07 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1426
* 05:06 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1411
* 05:05 bd808: Killed 1 orphan php-fcgi process from jembot that were running on tools-webgrid-lighttpd-1409
* 05:02 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1421 and pegging the cpu there
* 04:56 bd808: Rescheduled 30 of the 60 tools running on tools-webgrid-lighttpd-1421 ([[phab:T186830|T186830]])
* 04:39 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1417 and pegging the cpu there
 
=== 2018-02-08 ===
* 18:38 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1002.tools.eqiad.wmflabs
* 18:35 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade jessie-wikimedia -v
* 18:33 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade oldstable -v
* 18:28 arturo: cordon & drain tools-worker-1002.tools.eqiad.wmflabs
* 18:10 arturo: uncordon tools-paws-worker-1019. Package upgrades were OK.
* 18:08 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stable -v
* 18:06 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stretch-wikimedia -v
* 18:02 arturo: cordon tools-paws-worker-1019 to do some package upgrades
* 17:29 arturo: repool tools-exec-1401.tools.eqiad.wmflabs. Package upgrades were OK.
* 17:20 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-updates -vy
* 17:15 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-wikimedia -vy
* 17:11 arturo: depool tools-exec-1401.tools.eqiad.wmflabs to do some package upgrades
* 14:22 arturo: it was some kind of transient error. After a second puppet run across the fleet, all seems fine
* 13:53 arturo: deploy https://gerrit.wikimedia.org/r/#/c/407465/ which is causing some puppet issues. Investigating.
 
=== 2018-02-06 ===
* 13:15 arturo: deploy https://gerrit.wikimedia.org/r/#/c/408529/ to tools-services-01
* 13:05 arturo: unpublish/publish trusty-tools repo
* 13:03 arturo: install aptly v0.9.6-1 in tools-services-01 for [[phab:T186539|T186539]] after adding it to trusty-tools repo (self contained)
 
=== 2018-02-05 ===
* 17:58 arturo: publishing/unpublishing trusty-tools repo in tools-services-01 to address [[phab:T186539|T186539]]
* 13:27 arturo: for the record, not a single warning or error (orange/red messages) in puppet in the toolforge cluster
* 13:06 arturo: deploying fix for [[phab:T186230|T186230]] using clush
 
=== 2018-02-03 ===
* 01:04 chicocvenancio: killed io intensive process in bastion-03 "vltools  python3 ./broken_ref_anchors.py"
 
=== 2018-01-31 ===
* 22:54 chasemp: add bstorm to sudoers as root
 
=== 2018-01-29 ===
* 20:02 chasemp: add zhuyifei1999_ tools root for  [[phab:T185577|T185577]]
* 20:01 chasemp: blast a puppet run to see if any errors are persistent
 
=== 2018-01-28 ===
* 22:49 chicocvenancio: killed compromised session generating miner processes
* 22:48 chicocvenancio: killed miner processes in tools-bastion-03
 
=== 2018-01-27 ===
* 00:55 arturo: at tools-static-11 the kernel OOM killer stopped git gc at about 20% :-(
* 00:25 arturo: (/srv is almost full) aborrero@tools-static-11:/srv/cdnjs$ sudo git gc --aggressive
 
=== 2018-01-25 ===
* 23:47 arturo: fix last deprecation warnings in tools-elastic-03, tools-elastic-02, tools-proxy-01 and tools-proxy-02 by replacing by hand configtimeout with http_configtimeout in /etc/puppet/puppet.conf
* 23:20 arturo: [[phab:T179386|T179386]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 05:25 arturo: deploying misctools and jobutils 1.29 for [[phab:T179386|T179386]]
 
=== 2018-01-23 ===
* 19:41 madhuvishy: Add bstorm to project admins
* 15:48 bd808: Admin clean up; removed Coren, Ryan Lane, and Springle.
* 14:17 chasemp: add me, arturo, chico to sudoers and removed marc
 
=== 2018-01-22 ===
* 18:32 arturo: [[phab:T181948|T181948]] [[phab:T185314|T185314]] deploying jobutils and misctools v1.28 in the cluster
* 11:21 arturo: puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790
* 10:31 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v' <--- check again how is the cluster with puppet
* 10:18 arturo: [[phab:T181948|T181948]] deploy misctools 1.27 in the cluster
 
=== 2018-01-19 ===
* 17:32 arturo: [[phab:T185314|T185314]] deploying new version of jobutils 1.27
* 12:56 arturo: the puppet status across the fleet seems good, only minor things like [[phab:T185314|T185314]] , [[phab:T179388|T179388]] and [[phab:T179386|T179386]]
* 12:39 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
 
=== 2018-01-18 ===
* 16:11 arturo: aborrero@tools-clushmaster-01:~$ sudo aptitude purge vblade vblade-persist runit (for something similar to [[phab:T182781|T182781]])
* 15:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 13:52 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -f 1 -w @all 'sudo facter {{!}} grep lsbdistcodename {{!}} grep trusty && sudo apt-upgrade trusty-wikimedia -v'
* 13:44 chasemp: upgrade wikimedia packages on tools-bastion-05
* 12:24 arturo: [[phab:T178717|T178717]] aborrero@tools-exec-1401:~$ sudo apt-upgrade trusty-wikimedia -v
* 12:11 arturo: [[phab:T178717|T178717]] aborrero@tools-webgrid-generic-1402:~$ sudo apt-upgrade trusty-wikimedia -v
* 11:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
 
=== 2018-01-17 ===
* 18:47 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions {{!}} grep upgradeable {{!}} grep trusty-wikimedia' {{!}} tee pending-upgrades-report-trusty-wikimedia.txt
* 17:55 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' {{!}} tee pending-upgrades-report.txt
* 15:15 andrewbogott: running purge-old-kernels on all Trusty exec nodes
* 15:15 andrewbogott: repooling exec-manage tools-exec-1430.
* 15:04 andrewbogott: depooling exec-manage tools-exec-1430.  Experimenting with purge-old-kernels
* 14:09 arturo: [[phab:T181647|T181647]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
 
=== 2018-01-16 ===
* 22:01 chasemp: qstat -explain E -xml {{!}} grep 'name' {{!}} sed 's/<name>//' {{!}} sed 's/<\/name>//'  {{!}} xargs qmod -cq
* 21:54 chasemp: tools-exec-1436:~$ /sbin/reboot
* 21:24 andrewbogott: repooled tools-exec-1420  and tools-webgrid-lighttpd-1417
* 21:14 andrewbogott: depooling tools-exec-1420  and tools-webgrid-lighttpd-1417
* 20:58 andrewbogott: depooling tools-exec-1412, 1415, 1417, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
* 20:56 andrewbogott: repooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: depooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: repooled tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:33 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:20 andrewbogott: depooling tools-webgrid-lighttpd-1412  and tools-exec-1423 for host reboot
* 20:19 andrewbogott: repooled tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:02 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:00 andrewbogott: depooled and repooled tools-webgrid-lighttpd-1427 tools-webgrid-lighttpd-1428 tools-exec-1413  tools-exec-1442 for host reboot
* 18:50 andrewbogott: switched active proxy back to tools-proxy-02
* 18:50 andrewbogott: repooling tools-exec-1422 and tools-webgrid-lighttpd-1413
* 18:34 andrewbogott: moving proxy from tools-proxy-02 to tools-proxy-01
* 18:31 andrewbogott: depooling tools-exec-1422 and tools-webgrid-lighttpd-1413 for host reboot
* 18:26 andrewbogott: repooling tools-exec-1404 and 1434 for host reboot
* 18:06 andrewbogott: depooling tools-exec-1404 and 1434 for host reboot
* 18:04 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:48 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:28 andrewbogott: disabling tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403, tools-exec-1403 for host reboot
* 17:26 andrewbogott: repooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 after host reboot
* 17:08 andrewbogott: depooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 for host reboot
* 16:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 after host reboot
* 15:52 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 for host reboot
* 13:35 chasemp: tools-mail  almouked@ltnet.net 719 pending messages cleared
 
=== 2018-01-11 ===
* 20:33 andrewbogott: repooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 20:33 andrewbogott: uncordoning tools-worker-1012 and tools-worker-1017
* 20:06 andrewbogott: cordoning tools-worker-1012 and tools-worker-1017
* 20:02 andrewbogott: depooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 19:00 chasemp: reboot tools-worker-1015
* 15:08 chasemp: reboot tools-exec-1405
* 15:06 chasemp: reboot tools-exec-1404
* 15:06 chasemp: reboot tools-exec-1403
* 15:02 chasemp: reboot tools-exec-1402
* 14:57 chasemp: reboot tools-exec-1401 again...
* 14:53 chasemp: reboot tools-exec-1401
* 14:46 chasemp: install metltdown kernel and reboot workers 1011-1016 as jessie pilot
 
=== 2018-01-10 ===
* 15:14 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 15:03 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1016`; do kubectl cordon $n; done
* 14:41 chasemp: tools-clushmaster-01:~$ clush -w @k8s-worker "sudo puppet agent --disable 'chase rollout'"
* 14:01 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1001.tools.eqiad.wmflabs
* 13:57 arturo: [[phab:T184604|T184604]] cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs
* 13:46 arturo: [[phab:T184604|T184604]] aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs
* 13:45 arturo: [[phab:T184604|T184604]] aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes
* 13:26 arturo: sudo kubectl drain tools-worker-1020.tools.eqiad.wmflabs
* 13:22 arturo: empty by hand syslog and daemon.log files. They are so big that logrotate won't handle them
* 13:20 arturo: aborrero@tools-worker-1020:~$ sudo service kubelet restart
* 13:18 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for [[phab:T184604|T184604]]
* 13:13 arturo: detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened [[phab:T184604|T184604]]
 
=== 2018-01-09 ===
* 23:21 yuvipanda: paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading
* 23:08 yuvipanda: turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1
* 23:01 yuvipanda: kill paws master and reboot it
* 22:54 yuvipanda: kill all kube-system pods in paws cluster
* 22:54 yuvipanda: kill all PAWS pods
* 22:53 yuvipanda: redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason
* 22:49 yuvipanda: run  clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8
* 22:48 yuvipanda: run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash'' to setup kubeadm on all paws worker nodes
* 22:46 yuvipanda: reboot all paws-worker nodes
* 22:46 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 22:46 madhuvishy: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 21:17 chasemp: ...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 21:17 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test"
* 21:10 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done
* 20:55 chasemp: for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016`; do kubectl cordon $n; done
* 20:51 chasemp: kubectl cordon tools-worker-1001.tools.eqiad.wmflabs
* 20:15 chasemp: disable puppet on proxies and k8s workers
* 19:50 chasemp: clush -w @all 'sudo puppet agent --test'
* 19:42 chasemp: reboot tools-worker-1010
 
=== 2018-01-08 ===
* 20:34 madhuvishy: Restart kube services and uncordon tools-worker-1001
* 19:26 chasemp: sudo service docker restart; sudo service flannel restart; sudo service kube-proxy restart on tools-proxy-02
 
=== 2018-01-06 ===
* 00:35 madhuvishy: Run `clush -w @paws-worker -b 'sudo iptables -L FORWARD'`
* 00:05 madhuvishy: Drain and cordon tools-worker-1001 (for debugging the dns outage)
 
=== 2018-01-05 ===
* 23:49 madhuvishy: Run clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart' on tools-clushmaster-01
* 16:22 andrewbogott: moving tools-worker-1027 to labvirt1015 (CPU balancing)
* 16:01 andrewbogott: moving tools-worker-1017 to labvirt1017 (CPU balancing)
* 15:32 andrewbogott: moving tools-exec-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 15:18 andrewbogott: moving tools-exec-1411.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 15:02 andrewbogott: moving tools-exec-1440.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:47 andrewbogott: moving tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:25 andrewbogott: moving tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 13:46 andrewbogott: moving tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 05:33 andrewbogott: migrating tools-worker-1012 to labvirt1017 (CPU load balancing)
 
=== 2018-01-04 ===
* 17:24 andrewbogott: rebooting tools-paws-worker-1019 to verify repair of [[phab:T184018|T184018]]
 
=== 2018-01-03 ===
* 15:38 bd808: Forced Puppet run on tools-services-01
* 11:29 arturo: deploy https://gerrit.wikimedia.org/r/#/c/401716/ and https://gerrit.wikimedia.org/r/394101 using clush


==Archives==
==Archives==
* [[/Archive 1|Archive 1]] (2013-2014)
* [[Nova Resource:Tools/SAL/Archive 1|Archive 1]] (2013-2014)
* [[/Archive 2|Archive 2]] (2015-2017)
* [[Nova Resource:Tools/SAL/Archive 2|Archive 2]] (2015-2017)
* [[Nova Resource:Tools/SAL/Archive 3|Archive 3]] (2018-2019)
* [[Nova Resource:Tools/SAL/Archive 4|Archive 4]] (2020-2021)
</noinclude>
</noinclude>
{{SAL|Project Name=tools}}
{{SAL|Project Name=tools}}
<noinclude>[[Category:SAL]]</noinclude>
<noinclude>[[Category:SAL]]</noinclude>

Revision as of 16:58, 11 August 2022

2022-08-11

  • 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
  • 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues

2022-08-05

  • 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-08-03

2022-07-20

  • 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
  • 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-07-19

  • 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest

2022-07-17

  • 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-07-14

  • 13:48 taavi: rebooting tools-sgeexec-10-2

2022-07-13

  • 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-07-11

  • 16:06 wm-bot2: Increased quotas by {self.increases} (T312692) - cookbook ran by nskaggs@x1carbon

2022-07-07

  • 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-06-28

  • 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master (T311538) - cookbook ran by dcaro@vulcanus
  • 15:51 taavi: add 4096G cinder quota T311509

2022-06-27

  • 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
  • 18:02 taavi: switchover active cron server to tools-sgecron-2 T284767
  • 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:15 taavi: T311412 updating ca used by k8s-apiserver->etcd communication, breakage may happen
  • 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 T311412
  • 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it T311412

2022-06-23

  • 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 13:59 taavi: removing remaining continuous jobs from the stretch grid T277653

2022-06-22

  • 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-21

  • 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-03

  • 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor T309821
  • 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online T309821
  • 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
  • 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor T309821
  • 15:50 balloons: temp add 1.0G swap to sgeweblight hosts T309821
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
  • 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
  • 13:25 bd808: Upgrading fleet to tools-webservice 0.86 (T309821)
  • 13:20 bd808: publish tools-webservice 0.86 (T309821)
  • 12:46 taavi: start webservicemonitor on tools-sgecron-01 T309821
  • 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
  • 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid T309821
  • 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
  • 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package (T309821)
  • 03:10 bd808: publish tools-webservice 0.85 with hack for T309821

2022-06-02

  • 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
  • 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
  • 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
  • 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
  • 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
  • 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:03 dcaro: refresh prometheus certs (T308402)
  • 11:47 dcaro: refresh registry-admission-controller certs (T308402)
  • 11:42 dcaro: refresh ingress-admission-controller certs (T308402)
  • 11:36 dcaro: refresh volume-admission-controller certs (T308402)
  • 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster T277653
  • 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster T277653
  • 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-06-01

  • 11:18 taavi: depool and remove tools-sgeexec-09[07-14]

2022-05-31

  • 16:51 taavi: delete tools-sgeexec-0904 for T309525 experimentation

2022-05-30

  • 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) T277653

2022-05-26

2022-05-22

  • 17:04 taavi: failover tools-redis to the updated cluster T278541
  • 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud (T308982) - cookbook ran by taavi@runko

2022-05-16

2022-05-14

  • 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940

2022-05-12

2022-05-10

  • 15:18 taavi: depool tools-k8s-worker-42 for experiments
  • 13:54 taavi: enable distro-wikimedia unattended upgrades T290494

2022-05-06

  • 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl (T307812)

2022-05-05

  • 17:28 taavi: deploy tools-webservice 0.83 T307693

2022-05-03

  • 08:20 taavi: redis: start replication from the old cluster to the new one (T278541)

2022-05-02

  • 08:54 taavi: restart acme-chief.service T307333

2022-04-25

  • 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 (T214343)
  • 14:46 bd808: Building toolforge-webservice v0.82

2022-04-23

  • 16:51 bd808: Built new perl532-sssd/{base,web} images and pushed to registry (T214343)

2022-04-20

2022-04-16

2022-04-12

  • 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' (T305986)
  • 21:27 bd808: Added komla to 'roots' sudoers policy (T305986)
  • 21:24 bd808: Add komla as projectadmin (T305986)

2022-04-10

  • 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)

2022-04-09

  • 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /

2022-04-08

  • 10:44 arturo: disabled debug mode on the k8s jobs-emailer component

2022-04-05

2022-04-04

2022-03-28

  • 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud (T304816) - cookbook ran by arturo@nostromo

2022-03-15

2022-03-14

  • 11:44 arturo: deploy jobs-framework-emailer 9470a5f (T286135)
  • 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo (T297090)

2022-03-10

  • 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902

2022-03-01

  • 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state (T302702)
  • 12:11 dcaro: Cleared error state queues for sgeexec-0916 (T302702)
  • 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand

2022-02-28

  • 08:02 taavi: reboot sgeexec-0916
  • 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /

2022-02-17

  • 08:23 taavi: deleted tools-clushmaster-02
  • 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access

2022-02-16

  • 00:12 bd808: Image builds completed.

2022-02-15

  • 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
  • 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
  • 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
  • 22:50 bd808: Built new toollabs-webservice 0.81
  • 18:43 bd808: Enabled puppet on tools-proxy-05
  • 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
  • 18:21 taavi: delete tools-package-builder-03
  • 11:49 arturo: invalidate sssd cache in all bastions to debug T301736
  • 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for T301736
  • 11:15 arturo: reboot tools-sgebastion-10 for T301736

2022-02-10

  • 15:07 taavi: shutdown tools-clushmaster-02 T298191
  • 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally T214427
  • 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - T214427
  • 08:06 taavi: disable puppet globally for enabling puppetdb T214427

2022-02-09

  • 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet T214427
  • 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] (T277653) - cookbook ran by arturo@nostromo
  • 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
  • 18:25 arturo: ignore last message
  • 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
  • 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 T298191

2022-02-07

  • 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository (T288406)
  • 12:52 taavi: updated maintain-kubeusers for T301081

2022-02-04

  • 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with T301015
  • 21:36 taavi: clear error state from some webgrid nodes

2022-02-03

  • 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
  • 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate

2022-01-30

  • 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover T278541
  • 14:22 taavi: creating a cluster of 3 bullseye redis hosts for T278541

2022-01-26

  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
  • 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
  • 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
  • 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
  • 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
  • 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
  • 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
  • 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
  • 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes (T277653)

2022-01-25

  • 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
  • 11:44 arturo: rebooting buster exec nodes
  • 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4

2022-01-24

  • 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
  • 15:23 arturo: scaling up the grid with 10 buster exec nodes (T277653)

2022-01-20

  • 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
  • 12:56 arturo: scaling up the grid with 10 buster exec nodes (T277653)

2022-01-19

  • 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move

2022-01-14

  • 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, T299243

2022-01-12

  • 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
  • 11:03 arturo: created puppet prefix 'tools-sgeweblig'
  • 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'

2022-01-04

  • 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
  • 08:12 taavi: disable puppet & exim4 on T298501

Archives