You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL

From Wikitech-static
< Nova Resource:Tools
Revision as of 08:51, 11 September 2021 by imported>Stashbot (majavah: depool tools-sgeexec-0907)
Jump to navigation Jump to search

2021-09-11

  • 08:51 majavah: depool tools-sgeexec-0907

2021-09-10

  • 23:26 bstorm: cleared error state for tools-sgeexec-0907.tools.eqiad.wmflabs
  • 12:00 arturo: shutdown tools-package-builder-03 (buster), leave -04 online (bullseye)
  • 09:35 arturo: live-hacking tools puppetmaster with a couple of ops/puppet changes
  • 07:54 arturo: created bullseye VM tools-package-builder-04 (T273942)

2021-09-09

  • 16:20 arturo: 70017ec0ac root@tools-k8s-control-3:~# kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml

2021-09-07

  • 15:27 majavah: rolling out python3-prometheus-client updates
  • 14:41 majavah: manually removing some absented but still present crontabs to stop root@ spam

2021-09-06

  • 16:31 arturo: deploying jobs-framework-cli v4
  • 16:22 arturo: deploying jobs-framework-api 3228d97

2021-09-03

  • 22:36 bstorm: backfilling quotas in screen for T286784
  • 12:49 majavah: deploying new tools-manifest version

2021-09-02

  • 01:02 bstorm: deployed new version of maintain-kubeusers with new count quotas for new tools T286784

2021-08-20

  • 19:10 majavah: rebuilding node12-sssd/{base,web} to use debian packaged npm 7
  • 18:42 majavah: rebuilding php74-sssd/{base,web} to use composer 2

2021-08-18

  • 21:32 bstorm: rebooted tools-sgecron-01 due to a ram filling up and killing everything
  • 16:34 bstorm: deleting the sssd cache on tools-sgecron-01 to fix a peculiar passwd db issue

2021-08-16

  • 17:00 majavah: remove and re-add toollabs-webservice 0.75 on stretch-toolsbeta repository
  • 15:45 majavah: reset sul account mapping on striker for developer account "DutchTom" T288969
  • 14:19 majavah: building node12 images - T284590 T243159

2021-08-15

  • 17:30 majavah: deploying update jobs-framework-api container list to include bullseye images
  • 17:22 majavah: finished initial build of images: php74, jdk17, python39, ruby27 - T284590
  • 16:51 majavah: starting build of initial bullseye based images - T284590
  • 16:44 majavah: tagged and building toollabs-webservice 0.76 with bullseye images defined T284590
  • 15:14 majavah: building tools-webservice 0.74 (currently live version) to bullseye-tools and bullseye-toolsbeta

2021-08-12

  • 16:59 bstorm: deployed updated manifest for ingress-admission
  • 16:45 bstorm: restarted ingress admission pods in tools after testing in toolsbeta
  • 16:27 bstorm: updated the docker image for docker-registry.tools.wmflabs.org/ingress-admission:latest
  • 16:22 bstorm: rebooting tools-docker-registry-05 after exchanging uids for puppet and docker-registry

2021-08-07

  • 05:59 majavah: restart nginx on toolserver-proxy-01 if that helps with flapping icinga certificate expiry check

2021-08-06

  • 16:17 bstorm: failed over to tools-docker-registry-06 (which has more space) T288229
  • 00:43 bstorm: set up sync between the new registry host and the existing one T288229
  • 00:21 bstorm: provisioning second docker registry server to rsync to (120GB disk and fairly large server) T288229

2021-08-05

  • 23:50 bstorm: rebooting the docker registry T288229
  • 23:04 bstorm: extended docker registry volume to 120GB T288229

2021-07-29

  • 18:04 majavah: reset sul account mapping on striker for developer account "Derek Zax" T287369

2021-07-28

  • 21:33 majavah: add mdipietro as projectadmin and to sudo policy T287287

2021-07-27

  • 16:20 bstorm: built new php images with python2 on board T287421
  • 00:04 bstorm: deploy a version of the php3.7 web image that includes the python2 package with tag :testing T287421

2021-07-26

  • 17:37 bstorm: repooled the whole set of ingress workers after upgrades T280340
  • 16:37 bstorm: removing tools-k8s-ingress-4 from active ingress nodes at the proxy T280340

2021-07-23

  • 07:15 majavah: restart nginx on tools-static-14 to see if it helps with fontcdn issues

2021-07-22

  • 23:35 bstorm: deleted tools-sgebastion-09 since it has been shut off since March anyway
  • 15:32 arturo: re-deploying toolforge-jobs-framework-api
  • 15:30 arturo: pushed new docker image on the registry for toolforge-jobs-framework-api 4d8235b (T287077)

2021-07-21

  • 20:01 bstorm: deployed new maintain-kubeusers to toolforge T285011
  • 19:55 bstorm: deployed new rbac for maintain-kubeusers changes T285011
  • 17:10 majavah: deploying calico v3.18.4 T280342
  • 14:35 majavah: updating systemd on toolforge stretch bastions T287036
  • 11:59 arturo: deploying jobs-framework-api 07346d7 (T286108)
  • 11:04 arturo: enabling TTLAfterFinished feature gate on kubeadm live configmap (T286108)
  • 11:01 arturo: enabling TTLAfterFinished feature gate on static pod manifests on /etc/kubernetes/manifests/kube-{apiserver,controller-manager}.yaml in all 3 control nodes (T286108)

2021-07-20

  • 18:42 majavah: deploying systemd security tools on toolforge public stretch machines T287004
  • 17:45 arturo: pushed new toolforge-jobs-framework-api docker image into the registry (3a6ae38) (T286126
  • 17:37 arturo: added toolforge-jobs-framework-cli v3 to aptly buster-tools and buster-toolsbeta
  • 13:25 majavah: apply buster systemd security updates

2021-07-19

  • 23:24 bstorm: applied matchPolicy: equivalent to tools ingress validation controller T280360
  • 16:43 bstorm: cleared queue error state caused by excessive resource use by topicmatcher T282474

2021-07-16

  • 14:04 arturo: deployed jobs-framework-api 42b7a88 (T286132)
  • 11:57 arturo: added toollabs-webservice_0.75_all to jessie-tools aptly repo (T286003)
  • 11:52 arturo: created `jessie-tools` aptly repository on tools-services-05 (T286003)

2021-07-15

2021-07-14

  • 23:29 bstorm: mounted nfs on tools-services-05 and backing up aptly to NFS dir T286003
  • 09:17 majavah: copying calico 3.18.4 images from docker hub to docker-registry.tools.wmflabs.org T280342

2021-07-12

  • 16:56 bstorm: deleted job 4720371 due to LDAP failure
  • 16:51 bstorm: cleared the E state from two job queues

2021-07-02

  • 18:46 bstorm: cleared error state for tools-sgeexec-0940.tools.eqiad.wmflabs

2021-07-01

  • 22:08 bstorm: releasing webservice 0.75
  • 17:03 andrewbogott: rebooting tools-k8s-worker-[31,33,35,44,49,51,57-58,70].tools.eqiad1.wikimedia.cloud
  • 16:47 bstorm: remounted scratch everywhere...but mostly tools T224747
  • 15:47 arturo: rebased labs/private.git
  • 11:04 arturo: added toolforge-jobs-framework-cli_1_all.deb to aptly buster-tools,buster-toolsbeta
  • 10:34 arturo: refreshed jobs-api deployment

2021-06-29

  • 21:58 bstorm: clearing one errored queue and a stack of discarded jobs
  • 20:11 majavah: toolforge kubernetes upgrade complete T280299
  • 17:03 majavah: starting toolforge kubernetes 1.18 upgrade - T280299
  • 16:17 arturo: deployed jobs-framework-api in the k8s cluster
  • 15:34 majavah: remove duplicate definitions from tools-clushmaster-02 /root/.ssh/known_hosts
  • 15:12 arturo: livehacking puppetmaster for T283238
  • 10:24 dcaro: running puppet on the buster bastions after 20000 minutes failing... might break something

2021-06-15

  • 19:02 bstorm: cleared error status from a few queues
  • 16:15 majavah: deleting unused shutdown nodes: tools-checker-03 tools-k8s-haproxy-1 tools-k8s-haproxy-2

2021-06-14

  • 22:21 bstorm: push docker-registry.tools.wmflabs.org/toolforge-python37-sssd-web:testing to test staged os.execv (and other patches) using toolsbeta toollabs-webservice version 0.75 T282975

2021-06-13

  • 08:15 majavah: clear grid error state from tools-sgeexec-0907, tools-sgeexec-0916, tools-sgeexec-0940

2021-06-12

  • 14:39 majavah: remove nonexistent tools-prometheus-04 and add tools-prometheus-05 to hiera key "prometheus_nodes"
  • 13:53 majavah: create empty bullseye-{tools,toolsbeta} repositories on tools-services-05 aptly

2021-06-10

  • 17:38 majavah: clear error state from tools-sgeexec-0907, task@tools-sgeexec-0939

2021-06-09

  • 13:57 majavah: clear error state from exec nodes tools-sgeexec-0913, tools-sgeexec-0936, task@tools-sgeexec-0940

2021-06-07

2021-06-04

  • 21:30 bstorm: deleting "tools-k8s-ingress-3", "tools-k8s-ingress-2", "tools-k8s-ingress-1" T264221
  • 21:21 bstorm: cleared error state from 4 grid queues

2021-06-03

  • 18:27 majavah: renew prometheus kubernetes certificate T280301
  • 17:06 majavah: renew admission webhook certificates T280301

2021-06-01

  • 10:10 majavah: properly clean up deleted vms tools-k8s-haproxy-[1,2], tools-checker-03 from puppet after using the wrong fqdn first time
  • 09:54 majavah: clear error state from tools-sgeexec-0913, tools-sgeexec-0950

2021-05-30

  • 18:58 majavah: clear grid error state from 14 queues

2021-05-27

  • 18:03 bstorm: adjusted profile::wmcs::kubeadm::etcd_latency_ms from 30 back to the default (10)
  • 16:04 bstorm: cleared error state from several exec node queues
  • 14:49 andrewbogott: swapping in three new etcd nodes with local storage: tools-k8s-etcd-13,14,15

2021-05-24

  • 10:36 arturo: rebased labs/private.git after merge conflict
  • 06:49 majavah: remove scfc kubernetes admin access after bd808 removed tools.admin membership to avoid maintain-kubeusers crashes when it expires

2021-05-22

  • 14:47 majavah: manually remove jeh admin certificates and from maintain-kubeusers configmap T282725
  • 14:32 majavah: manually remove valhallasw yuvipanda admin certificates and from configmap and restart maintain-kubeusers pod T282725
  • 02:51 bd808: Restarted nginx on tools-static-14 to see if that clears up the fontcdn 502 errors

2021-05-21

  • 17:06 majavah: unpool tooks-k8s-ingress-[4-6]
  • 17:06 majavah: repool tools-k8s-ingress-6
  • 17:02 majavah: repool tools-k8s-ingress-4 and -5
  • 16:59 bstorm: upgrading the ingress-gen2 controllers to release 3 to capture new RAM/CPU limits
  • 16:43 bstorm: resize tools-k8s-ingress-4 to g3.cores4.ram8.disk20
  • 16:43 bstorm: resize tools-k8s-ingress-6 to g3.cores4.ram8.disk20
  • 16:40 bstorm: resize tools-k8s-ingress-5 to g3.cores4.ram8.disk20
  • 16:04 majavah: rollback kubernetes ingress update from front proxy
  • 06:52 Majavah: pool tools-k8s-ingress-6 and depool ingress-[2,3] T264221

2021-05-20

  • 17:05 Majavah: pool tools-k8s-ingress-5 as an ingress node, depool ingress-1 T264221
  • 16:31 Majavah: pool tools-k8s-worker-4 as an ingress node T264221
  • 15:17 Majavah: trying to install ingress-nginx via helm again after adjusting security groups T264221
  • 15:15 Majavah: move tools-k8s-ingress-[5-6] from "tools-k8s-full-connectivity" to "tools-new-k8s-full-connectivity" security group T264221

2021-05-19

  • 12:15 Majavah: rollback ingress-nginx-gen2
  • 11:09 Majavah: deploy helm-based nginx ingress controller v0.46.0 to ingress-nginx-gen2 namespace T264221
  • 10:44 Majavah: create tools-k8s-ingress-[4-6] T264221

2021-05-16

  • 16:52 Majavah: clear error state from tools-sgeexec-0905 tools-sgeexec-0907 tools-sgeexec-0936 tools-sgeexec-0941

2021-05-14

  • 19:18 bstorm: adjusting the rate limits for bastions nfs_write upward a lot to make NFS writes faster now that the cluster is finally using 10Gb on the backend and frontend T218338
  • 16:55 andrewbogott: rebooting toolserver-proxy-01 to clear up stray files
  • 16:47 andrewbogott: deleting log files older than 14 days on toolserver-proxy-01

2021-05-12

  • 19:45 bstorm: cleared error state from some queues
  • 19:05 Majavah: remove phamhi-binding phamhi-view-binding cluster role bindings T282725
  • 19:04 bstorm: deleted the maintain-kubeusers pod to get it up and running fast T282725
  • 19:03 bstorm: deleted phamhi from admin configmap in maintain-kubeusers T282725

2021-05-11

  • 17:17 Majavah: shutdown and delete tools-checker-03 T278540
  • 17:14 Majavah: move floating ip 185.15.56.61 to tools-checker-04
  • 17:12 Majavah: add tools-checker-04 as a grid submit host T278540
  • 16:58 Majavah: add tools-checker-04 to toollabs::checker_hosts hiera key T278540
  • 16:49 Majavah: creating tools-checker-04 with buster T278540
  • 16:32 Majavah: carefully shutdown tools-k8s-haproxy-1 T252239
  • 16:29 Majavah: carefully shutdown tools-k8s-haproxy-2 T252239

2021-05-10

  • 22:58 bstorm: cleared error state on a grid queue
  • 22:58 bstorm: setting `profile::wmcs::kubeadm::docker_vol: false` on ingress nodes
  • 15:22 Majavah: change k8s.svc.tools.eqiad1.wikimedia.cloud. to point to the tools-k8s-haproxy-keepalived-vip address 172.16.6.113 (T252239)
  • 15:06 Majavah: carefully rolling out keepalived to tools-k8s-haproxy-[3-4] while making sure [1-2] do not have changes
  • 15:03 Majavah: clear all error states caused by overloaded exec nodes
  • 14:57 arturo: allow tools-k8s-haproxy-[3-4] to use the tools-k8s-haproxy-keepalived-vip address (172.16.6.113) (T252239)
  • 12:53 Majavah: creating tools-k8s-haproxy-[3-4] to rebuild current ones without nfs and with keepalived

2021-05-09

  • 06:55 Majavah: clear error state from tools-sgeexec-0916

2021-05-08

  • 10:57 Majavah: import docker image k8s.gcr.io/ingress-nginx/controller:v0.46.0 to local registry as docker-registry.tools.wmflabs.org/nginx-ingress-controller:v0.46.0 T264221

2021-05-07

  • 18:07 Majavah: generate and add k8s haproxy keepalived password (profile::toolforge::k8s::haproxy::keepalived_password) to private puppet repo
  • 17:15 bstorm: recreated recordset of k8s.tools.eqiad1.wikimedia.cloud as CNAME to k8s.svc.tools.eqiad1.wikimedia.cloud T282227
  • 17:12 bstorm: created A record of k8s.svc.tools.eqiad1.wikimedia.cloud pointing at current cluster with TTL of 300 for quick initial failover when the new set of haproxy nodes are ready T282227
  • 09:44 arturo: `sudo wmcs-openstack --os-project-id=tools port create --network lan-flat-cloudinstances2b tools-k8s-haproxy-keepalived-vip`

2021-05-06

  • 14:43 Majavah: clear error states from all currently erroring exec nodes
  • 14:37 Majavah: clear error state from tools-sgeexec-0913
  • 04:35 Majavah: add own root key to project hiera on horizon T278390
  • 02:36 andrewbogott: removing jhedden from sudo roots

2021-05-05

  • 19:27 andrewbogott: adding taavi as a sudo root to project toolforge for T278390

2021-05-04

  • 15:23 arturo: upgrading exim4-daemon-heavy in tools-mail-03
  • 10:47 arturo: rebase & resolve merge conflicts in labs/private.git

2021-05-03

  • 16:24 dcaro: started tools-sgeexec-0907, was stuck on initramfs due to an unclean fs (/dev/vda3, root), ran fsck manually fixing all the errors and booted up correctly after (T280641)
  • 14:07 dcaro: depooling tols-sgeexec-0908/7 to be able to restart the VMs as they got stuck during migration (T280641)

2021-04-29

  • 18:23 bstorm: removing one more etcd node via cookbook T279723
  • 18:12 bstorm: removing an etcd node via cookbook T279723

2021-04-27

  • 16:40 bstorm: deleted all the errored out grid jobs stuck in queue wait
  • 16:16 bstorm: cleared E status on grid queues to get things flowing again

2021-04-26

  • 12:17 arturo: allowing more tools into the legacy redirector (T281003)

2021-04-22

  • 08:44 Krenair: Removed yuvipanda from roots sudo policy
  • 08:42 Krenair: Removed yuvipanda from projectadmin per request
  • 08:40 Krenair: Removed yuvipanda from tools.admin per request

2021-04-20

  • 22:20 bd808: `clush -w @all -b "sudo exiqgrep -z -i | xargs sudo exim -Mt"`
  • 22:19 bd808: `clush -w @exec -b "sudo exiqgrep -z -i | xargs sudo exim -Mt"`
  • 21:52 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad1.wikimedia.cloud`. Was using wrong domain name in prior update.
  • 21:49 bstorm: tagged the latest maintain-kubeusers and deployed to toolforge (with kustomize changes to rbac) after testing in toolsbeta T280300
  • 21:27 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad.wmflabs`. was -2 which is decommed.
  • 10:18 dcaro: seting the retention on the tools-prometheus VMs to 250GB (they have 276GB total, leaving some space for online data operations if needed) (T279990)

2021-04-19

  • 10:53 dcaro: reverting setting prometheus data source in grafana to 'server', can't connect,
  • 10:51 dcaro: setting prometheus data source in grafana to 'server' to avoid CORS issues

2021-04-16

  • 23:15 bstorm: cleaned up all source files for the grid with the old domain name to enable future node creation T277653
  • 14:38 dcaro: added 'will get out of space in X days' panel to the dasboard https://grafana-labs.wikimedia.org/goto/kBlGd0uGk (T279990), we got <5days xd
  • 11:35 arturo: running `grid-configurator --all-domains` which basically added tools-sgebastion-10,11 as submit hosts and removed tools-sgegrid-master,shadow as submit hosts

2021-04-15

  • 17:45 bstorm: cleared error state from tools-sgeexec-0920.tools.eqiad.wmflabs for a failed job

2021-04-13

  • 13:26 dcaro: upgrade puppet and python-wmflib on tools-prometheus-03
  • 11:23 arturo: deleted shutoff VM tools-package-builder-02 (T275864)
  • 11:21 arturo: deleted shutoff VM tools-sge-services-03,04 (T278354)
  • 11:20 arturo: deleted shutoff VM tools-docker-registry-03,04 (T278303)
  • 11:18 arturo: deleted shutoff VM tools-mail-02 (T278538)
  • 11:17 arturo: deleted shutoff VMs tools-static-12,13 (T278539)

2021-04-11

  • 16:07 bstorm: cleared E state from tools-sgeexec-0917 tools-sgeexec-0933 tools-sgeexec-0934 tools-sgeexec-0937 from failures of jobs 761759, 815031, 815056, 855676, 898936

2021-04-08

  • 18:25 bstorm: cleaned up the deprecated entries in /data/project/.system_sge/gridengine/etc/submithosts for tools-sgegrid-master and tools-sgegrid-shadow using the old fqdns T277653
  • 09:24 arturo: allocate & associate floating IP 185.15.56.122 for tools-sgebastion-11, also with DNS A record `dev-buster.toolforge.org` (T275865)
  • 09:22 arturo: create DNS A record `login-buster.toolforge.org` pointing to 185.15.56.66 (tools-sgebastion-10) (T275865)
  • 09:20 arturo: associate floating IP 185.15.56.66 to tools-sgebastion-10 (T275865)
  • 09:13 arturo: created tools-sgebastion-11 (buster) (T275865)

2021-04-07

  • 04:35 andrewbogott: replacing the mx record '10 mail.tools.wmcloud.org' with '10 mail.tools.wmcloud.org.' — trying to fix axfr for the tools.wmcloud.org zone

2021-04-06

  • 15:16 bstorm: cleared queue state since a few had "errored" for failed jobs.
  • 12:59 dcaro: Removing etcd member tools-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number (T267082)
  • 11:45 arturo: upgrading jobutils & misctools to 1.42 everywhere
  • 11:39 arturo: cleaning up aptly: old package versions, old repos (jessie, trusty, precise) etc
  • 10:31 dcaro: Removing etcd member tools-k8s-etcd-6.tools.eqiad.wmflabs (T267082)
  • 10:21 arturo: published jobutils & misctools 1.42 (T278748)
  • 10:21 arturo: published jobutils & misctools 1.42
  • 10:21 arturo: aptly repo had some weirdness due to the cinder volume: hardlinks created by aptly were broken, solved with `sudo aptly publish --skip-signing repo stretch-tools -force-overwrite`
  • 10:07 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)
  • 10:05 arturo: installed aptly from buster-backports on tools-services-05 to see if that makes any difference with an issue when publishing repos
  • 09:53 dcaro: Removing etcd member tools-k8s-etcd-4.tools.eqiad.wmflabs (T267082)
  • 08:55 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

2021-04-05

  • 17:02 bstorm: chowned the data volume for the docker registry to docker-registry:docker-registry
  • 09:56 arturo: make jhernandez (IRC joakino) projectadmin (T278975)

2021-04-01

  • 20:43 bstorm: cleared error state from the grid queues caused by unspecified job errors
  • 15:53 dcaro: Removed etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs, adding a new member (T267082)
  • 15:43 dcaro: Removing etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs (T267082)
  • 15:36 dcaro: Added new etcd member tools-k8s-etcd-9.tools.eqiad1.wikimedia.cloud (T267082)
  • 15:18 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082)

2021-03-31

  • 15:57 arturo: rebooting `tools-mail-03` after enabling NFS (T267082, T278538)
  • 15:57 arturo: rebooting `tools-mail-03` after enabling NFS (T
  • 15:04 arturo: created MX record for `tools.wmcloud.org` pointing to `mail.tools.wmcloud.org`
  • 15:03 arturo: created DNS A record `mail.tools.wmcloud.org` pointing to 185.15.56.63
  • 14:56 arturo: shutoff tools-mail-02 (T278538)
  • 14:55 arturo: point floating IP 185.15.56.63 to tools-mail-03 (T278538)
  • 14:45 arturo: created VM `tools-mail-03` as Debian Buster (T278538)
  • 14:39 arturo: relocate some of the hiera keys for email server from project-level to prefix
  • 09:44 dcaro: running disk performance test on etcd-4 (round2)
  • 09:05 dcaro: running disk performance test on etcd-8
  • 08:43 dcaro: running disk performance test on etcd-4

2021-03-30

  • 16:15 bstorm: added `labstore::traffic_shaping::egress: 800mbps` to tools-static prefix T278539
  • 15:44 arturo: shutoff tools-static-12/13 (T278539)
  • 15:41 arturo: point horizon web proxy `tools-static.wmflabs.org` to tools-static-14 (T278539)
  • 15:37 arturo: add `mount_nfs: true` to tools-static prefix (T2778539)
  • 15:26 arturo: create VM tools-static-14 with Debian Buster image (T278539)
  • 12:19 arturo: introduce horizon proxy `deb-tools.wmcloud.org` (T278436)
  • 12:15 arturo: shutdown tools-sgebastion-09 (stretch)
  • 11:05 arturo: created VM `tools-sgebastion-10` as Debian Buster (T275865)
  • 11:04 arturo: created server group `tools-bastion` with anti-affinity policy

2021-03-28

  • 19:31 legoktm: legoktm@tools-sgebastion-08:~$ sudo qdel -f 9999704 # T278645

2021-03-27

2021-03-26

  • 12:21 arturo: shutdown tools-package-builder-02 (stretch), we keep -03 which is buster (T275864)

2021-03-25

  • 19:30 bstorm: forced deletion of all jobs stuck in a deleting state T277653
  • 17:46 arturo: rebooting tools-sgeexec-* nodes to account for new grid master (T277653)
  • 16:20 arturo: rebuilding tools-sgegrid-master VM as debian buster (T277653)
  • 16:18 arturo: icinga-downtime toolschecker for 2h
  • 16:05 bstorm: failed over the tools grid to the shadow master T277653
  • 13:36 arturo: shutdown tools-sge-services-03 (T278354)
  • 13:33 arturo: shutdown tools-sge-services-04 (T278354)
  • 13:31 arturo: point aptly clients to `tools-services-05.tools.eqiad1.wikimedia.cloud` (hiera change) (T278354)
  • 12:58 arturo: created VM `tools-services-05` as Debian Buster (T278354)
  • 12:51 arturo: create cinder volume `tools-aptly-data` (T278354)

2021-03-24

  • 12:46 arturo: shutoff the old stretch VMs `tools-docker-registry-03` and `tools-docker-registry-04` (T278303)
  • 12:38 arturo: associate floating IP 185.15.56.67 with `tools-docker-registry-05` and refresh FQDN docker-registry.tools.wmflabs.org accordingly (T278303)
  • 12:33 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-05` (T278303)
  • 12:32 arturo: snapshot cinder volume `tools-docker-registry-data` into `tools-docker-registry-data-stretch-migration` (T278303)
  • 12:32 arturo: bump cinder storage quota from 80G to 400G (without quota request task)
  • 12:11 arturo: created VM `tools-docker-registry-06` as Debian Buster (T278303)
  • 12:09 arturo: dettach cinder volume `tools-docker-registry-data` (T278303)
  • 11:46 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-03` to format it and pre-populate it with registry data (T278303)
  • 11:20 arturo: created 80G cinder volume tools-docker-registry-data (T278303)
  • 11:10 arturo: starting VM tools-docker-registry-04 which was stopped probably since 2021-03-09 due to hypervisor draining

2021-03-23

  • 12:46 arturo: aborrero@tools-sgegrid-master:~$ sudo systemctl restart gridengine-master.service
  • 12:16 arturo: delete & re-create VM tools-sgegrid-shadow as Debian Buster (T277653)
  • 12:14 arturo: created puppet prefix 'tools-sgegrid-shadow' and migrated puppet configuration from VM-puppet
  • 12:13 arturo: created server group 'tools-grid-master-shadow' with anty-affinity policy

2021-03-18

  • 19:24 bstorm: set profile::toolforge::infrastructure across the entire project with login_server set on the bastion and exec node-related prefixes
  • 16:21 andrewbogott: enabling puppet tools-wide
  • 16:20 andrewbogott: disabling puppet tools-wide to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456
  • 16:19 bstorm: added profile::toolforge::infrastructure class to puppetmaster T277756
  • 04:12 bstorm: rebooted tools-sgeexec-0935.tools.eqiad.wmflabs because it forgot how to LDAP...likely root cause of the issues tonight
  • 03:59 bstorm: rebooting grid master. sorry for the cron spam
  • 03:49 bstorm: restarting sssd on tools-sgegrid-master
  • 03:37 bstorm: deleted a massive number of stuck jobs that misfired from the cron server
  • 03:35 bstorm: rebooting tools-sgecron-01 to try to clear up the ldap-related errors coming out of it
  • 01:46 bstorm: killed the toolschecker cron job, which had an LDAP error, and ran it again by hand

2021-03-17

  • 20:57 bstorm: deployed changes to rbac for kubernetes to add kubectl top access for tools
  • 20:26 andrewbogott: moving tools-elastic-3 to cloudvirt1034; two elastic nodes shouldn't be on the same hv

2021-03-16

  • 16:31 arturo: installing jobutils and misctools 1.41
  • 15:55 bstorm: deleted a bunch of messed up grid jobs (9989481,8813,81682,86317,122602,122623,583621,606945,606999)
  • 12:32 arturo: add packages jobutils / misctools v1.41 to {stretch,buster}-tools aptly repository in tools-sge-services-03

2021-03-12

  • 23:13 bstorm: cleared error state for all grid queues

2021-03-11

  • 17:40 bstorm: deployed metrics-server:0.4.1 to kubernetes
  • 16:21 bstorm: add jobutils 1.40 and misctools 1.40 to stretch-tools
  • 13:11 arturo: add misctools 1.37 to buster-tools|toolsbeta aptly repo for T275865
  • 13:10 arturo: add jobutils 1.40 to buster-tools aptly repo for T275865

2021-03-10

  • 10:56 arturo: briefly stopped VM tools-k8s-etcd-7 to disable VMX cpu flag

2021-03-09

  • 13:31 arturo: hard-reboot tools-docker-registry-04 because issues related to T276922
  • 12:34 arturo: briefly rebooting VM tools-docker-registry-04, we need to reboot the hypervisor cloudvirt1038 and failed to migrate away

2021-03-05

  • 12:30 arturo: started tools-redis-1004 again
  • 12:22 arturo: stop tools-redis-1004 to ease draining of cloudvirt1035

2021-03-04

  • 11:25 arturo: rebooted tools-sgewebgrid-generic-0901, repool it again
  • 09:58 arturo: depool tools-sgewebgrid-generic-0901 to reboot VM. It was stuck in MIGRATING state when draining cloudvirt1022

2021-03-03

  • 15:17 arturo: shutting down tools-sgebastion-07 in an attempt to fix nova state and finish hypervisor migration
  • 15:11 arturo: tools-sgebastion-07 triggered a neutron exception (unauthorized) while being live-migrated from cloudvirt1021 to 1029. Resetting nova state with `nova reset-state bd685d48-1011-404e-a755-372f6022f345 --active` and try again
  • 14:48 arturo: killed pywikibot instance running in tools-sgebastion-07 by user msyn

2021-03-02

  • 15:23 bstorm: depooling tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs for reboot. It isn't communicating right
  • 15:22 bstorm: cleared queue error states...will need to keep a better eye on what's causing those

2021-02-27

  • 02:23 bstorm: deployed typo fix to maintain-kubeusers in an innocent effort to make the weekend better T275910
  • 02:00 bstorm: running a script to repair the dumps mount in all podpresets T275371

2021-02-26

  • 22:04 bstorm: cleaned up grid jobs 1230666,1908277,1908299,2441500,2441513
  • 21:27 bstorm: hard rebooting tools-sgeexec-0947
  • 21:21 bstorm: hard rebooting tools-sgeexec-0952.tools.eqiad.wmflabs
  • 20:01 bd808: Deleted csr in strange state for tool-ores-inspect

2021-02-24

  • 18:30 bd808: `sudo wmcs-openstack role remove --user zfilipin --project tools user` T267313
  • 01:04 bstorm: hard rebooting tools-k8s-worker-76 because it's in a sorry state

2021-02-23

  • 23:11 bstorm: draining a bunch of k8s workers to clean up after dumps changes T272397
  • 23:06 bstorm: draining tools-k8s-worker-55 to clean up after dumps changes T272397

2021-02-22

  • 20:40 bstorm: repooled tools-sgeexec-0918.tools.eqiad.wmflabs
  • 19:09 bstorm: hard rebooted tools-sgeexec-0918 from openstack T275411
  • 19:07 bstorm: shutting down tools-sgeexec-0918 with the VM's command line (not libvirt directly yet) T275411
  • 19:05 bstorm: shutting down tools-sgeexec-0918 (with openstack to see what happens) T275411
  • 19:03 bstorm: depooled tools-sgeexec-0918 T275411
  • 18:56 bstorm: deleted job 1962508 from the grid to clear it up T275301
  • 16:58 bstorm: cleared error state on several grid queues

2021-02-19

  • 12:31 arturo: deploying new version of toolforge ingress admission controller

2021-02-17

  • 21:26 bstorm: deleted tools-puppetdb-01 since it is unused at this time (and undersized anyway)

2021-02-04

  • 16:27 bstorm: rebooting tools-package-builder-02

2021-01-26

  • 16:27 bd808: Hard reboot of tools-sgeexec-0906 via Horizon for T272978

2021-01-22

  • 09:59 dcaro: added the record redis.svc.tools.eqiad1.wikimedia.cloud pointing to tools-redis1003 (T272679)

2021-01-21

  • 23:58 bstorm: deployed new maintain-kubeusers to tools T271847

2021-01-19

  • 22:57 bstorm: truncated 75GB error log /data/project/robokobot/virgule.err T272247
  • 22:48 bstorm: truncated 100GB error log /data/project/magnus-toolserver/error.log T272247
  • 22:43 bstorm: truncated 107GB log '/data/project/meetbot/logs/messages.log' T272247
  • 22:34 bstorm: truncating 194 GB error log '/data/project/mix-n-match/mnm-microsync.err' T272247
  • 16:37 bd808: Added Jhernandez to root sudoers group

2021-01-14

  • 20:56 bstorm: setting bastions to have mostly-uncapped egress network and 40MBps nfs_read for better shared use
  • 20:43 bstorm: running tc-setup across the k8s workers
  • 20:40 bstorm: running tc-setup across the grid fleet
  • 17:58 bstorm: hard rebooting tools-sgecron-01 following network issues during upgrade to stein T261134

2021-01-13

  • 10:02 arturo: delete floating IP allocation 185.15.56.245 (T271867)

2021-01-12

  • 18:16 bstorm: deleted wedged CSR tool-adhs-wde to get maintain-kubeusers working again T271842

2021-01-05

  • 18:49 bstorm: changing the limits on k8s etcd nodes again, so disabling puppet on them T267966

2021-01-04

  • 18:21 bstorm: ran 'sudo systemctl stop getty@ttyS1.service && sudo systemctl disable getty@ttyS1.service' on tools-k8s-etcd-5 I have no idea why that keeps coming back.

2020-12-22

  • 18:22 bstorm: rebooting the grid master because it is misbehaving following the NFS outage
  • 10:53 arturo: rebase & resolve ugly git merge conflict in labs/private.git

2020-12-18

  • 18:37 bstorm: set profile::wmcs::kubeadm::etcd_latency_ms: 15 T267966

2020-12-17

2020-12-11

  • 18:29 bstorm: certificatesigningrequest.certificates.k8s.io "tool-production-error-tasks-metrics" deleted to stop maintain-kubeusers issues
  • 12:14 dcaro: upgrading stable/main (clinic duty)
  • 12:12 dcaro: upgrading buster-wikimedia/main (clinic duty)
  • 12:03 dcaro: upgrading stable-updates/main, mainly cacertificates (clinic duty)
  • 12:01 dcaro: upgrading stretch-backports/main, mainly libuv (clinic duty)
  • 11:58 dcaro: disabled all the repos blocking upgrades on tools-package-builder-02 (duplicated, other releases...)
  • 11:35 arturo: uncordon tools-k8s-worker-71 and tools-k8s-worker-55, they weren't uncordoned yesterday for whatever reasons (T263284)
  • 11:27 dcaro: upgrading stretch-wikimedia/main (clinic duty)
  • 11:20 dcaro: upgrading stretch-wikimedia/thirdparty/mono-project-stretch (clinic duty)
  • 11:08 dcaro: upgrade stretch-wikimedia/component/php72 (minor upgrades) (clinic duty)
  • 11:04 dcaro: upgrade oldstable/main packages (clinic duty)
  • 10:58 dcaro: upgrade kubectl done (clinic duty)
  • 10:53 dcaro: upgrade kubectl (clinic duty)
  • 10:16 dcaro: upgrading oldstable/main packages (clinic duty)

2020-12-10

  • 17:35 bstorm: k8s-control nodes upgraded to 1.17.13 T263284
  • 17:16 arturo: k8s control nodes were all upgraded to 1.17, now upgrading worker nodes (T263284)
  • 15:50 dcaro: puppet upgraded to 5.5.10 on the hosts, ping me if you see anything weird (clinic duty)
  • 15:41 arturo: icinga-downtime toolschecker for 2h (T263284)
  • 15:35 dcaro: Puppet 5 on tools-sgebastion-09 ran well and without issues, upgrading the other sge nodes (clinic duty)
  • 15:32 dcaro: Upgrading puppet from 4 to 5 on tools-sgebastion-09 (clinic duty)
  • 12:41 arturo: set hiera `profile::wmcs::kubeadm::component: thirdparty/kubeadm-k8s-1-17` in project & tools-k8s-control prefix (T263284)
  • 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade (T263284)
  • 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade (T263284)
  • 09:58 dcaro: successful tesseract upgrade on tools-sgewebgrid-lighttpd-0914, upgrading the rest of nodes (clinic duty)
  • 09:49 dcaro: upgrading tesseract on tools-sgewebgrid-lighttpd-0914 (clinic duty)

2020-12-08

  • 19:01 bstorm: pushed updated calico node image (v3.14.0) to internal docker registry as well T269016

2020-12-07

  • 22:56 bstorm: pushed updated local copies of the typha, calico-cni and calico-pod2daemon-flexvol images to the tools internal registry T269016

2020-12-03

  • 09:18 arturo: restarted kubelet systemd service on tools-k8s-worker-38. Node was NotReady, complaining about 'use of closed network connection'
  • 09:16 arturo: restarted kubelet systemd service on tools-k8s-worker-59. Node was NotReady, complaining about 'use of closed network connection'

2020-11-28

  • 23:35 Krenair: Re-scheduled 4 continuous jobs from tools-sgeexec-0908 as it appears to be broken, at about 23:20 UTC
  • 04:35 Krenair: Ran `sudo -i kubectl -n tool-mdbot delete cm maintain-kubeusers` on tools-k8s-control-1 for T268904, seems to have regenerated ~tools.mdbot/.kube/config

2020-11-24

  • 17:44 arturo: rebased labs/private.git. 2 patches had merge conflicts
  • 16:36 bd808: clush -w @all -b 'sudo -i apt-get purge nscd'
  • 16:31 bd808: Ran `sudo -i apt-get purge nscd` on tools-sgeexec-0932 to try and fix apt state for puppet

2020-11-10

  • 19:45 andrewbogott: rebooting tools-sgeexec-0950; OOM

2020-11-02

  • 13:35 arturo: (typo: dcaro)
  • 13:35 arturo: added dcar as projectadmin & user (T266068)

2020-10-29

  • 21:33 legoktm: published docker-registry.tools.wmflabs.org/toolbeta-test image (T265681)
  • 21:10 bstorm: Added another ingress node to k8s cluster in case the load spikes are the problem T266506
  • 17:33 bstorm: hard rebooting tools-sgeexec-0905 and tools-sgeexec-0916 to get the grid back to full capacity
  • 04:03 legoktm: published docker-registry.tools.wmflabs.org/toolforge-buster0-builder:latest image (T265686)

2020-10-28

  • 23:42 bstorm: dramatically elevated the egress cap on tools-k8s-ingress nodes that were affected by the NFS settings T266506
  • 22:10 bstorm: launching tools-k8s-ingress-3 to try and get an NFS-free node T266506
  • 21:58 bstorm: set 'mount_nfs: false' on the tools-k8s-ingress prefix T266506

2020-10-23

  • 22:22 legoktm: imported pack_0.14.2-1_amd64.deb into buster-tools (T266270)

2020-10-21

  • 17:58 legoktm: pushed toolforge-buster0-{build,run}:latest images to docker registry

2020-10-15

  • 22:00 bstorm: manually removing nscd from tools-sgebastion-08 and running puppet
  • 18:23 andrewbogott: uncordoning tools-k8s-worker-53, 54, 55, 59
  • 17:28 andrewbogott: depooling tools-k8s-worker-53, 54, 55, 59
  • 17:27 andrewbogott: uncordoning tools-k8s-worker-35, 37, 45
  • 16:44 andrewbogott: depooling tools-k8s-worker-35, 37, 45

2020-10-14

  • 21:00 andrewbogott: repooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
  • 20:37 andrewbogott: depooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
  • 20:35 andrewbogott: repooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16
  • 20:31 bd808: Deployed toollabs-webservice v0.74
  • 19:53 andrewbogott: depooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16 and moving to Ceph
  • 19:47 andrewbogott: repooling tools-sgeexec-0932, 33, 34 and moving to Ceph
  • 19:07 andrewbogott: depooling tools-sgeexec-0932, 33, 34 and moving to Ceph
  • 19:06 andrewbogott: repooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph
  • 16:56 andrewbogott: depooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph

2020-10-10

  • 17:07 bstorm: cleared errors on tools-sgeexec-0912.tools.eqiad.wmflabs to get the queue moving again

2020-10-08

  • 17:07 bstorm: rebuilding docker images with locales-all T263339

2020-10-06

  • 19:04 andrewbogott: uncordoned tools-k8s-worker-38
  • 18:51 andrewbogott: uncordoned tools-k8s-worker-52
  • 18:40 andrewbogott: draining and cordoning tools-k8s-worker-52 and tools-k8s-worker-38 for ceph migration

2020-10-02

  • 21:09 bstorm: rebooting tools-k8s-worker-70 because it seems to be unable to recover from an old NFS disconnect
  • 17:37 andrewbogott: stopping tools-prometheus-03 to attempt a snapshot
  • 16:03 bstorm: shutting down tools-prometheus-04 to try to fsck the disk

2020-10-01

  • 21:39 andrewbogott: migrating tools-proxy-06 to ceph
  • 21:35 andrewbogott: moving k8s.tools.eqiad1.wikimedia.cloud from 172.16.0.99 (toolsbeta-test-k8s-haproxy-1) to 172.16.0.108 (toolsbeta-test-k8s-haproxy-2) in anticipation of downtime for haproxy-1 tomorrow

2020-09-30

  • 18:34 andrewbogott: repooling tools-sgeexec-0918
  • 18:29 andrewbogott: depooling tools-sgeexec-0918 so I can reboot cloudvirt1036

2020-09-23

  • 21:38 bstorm: ran an 'apt clean' across the fleet to get ahead of the new locale install

2020-09-18

  • 19:41 andrewbogott: repooling tools-k8s-worker-30, 33, 34, 57, 60
  • 19:04 andrewbogott: depooling tools-k8s-worker-30, 33, 34, 57, 60
  • 19:02 andrewbogott: repooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
  • 17:48 andrewbogott: depooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
  • 17:47 andrewbogott: repooling tools-k8s-worker-31, 32, 36, 39, 40
  • 16:40 andrewbogott: depooling tools-k8s-worker-31, 32, 36, 39, 40
  • 16:38 andrewbogott: repooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
  • 16:10 andrewbogott: depooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
  • 13:54 andrewbogott: repooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916
  • 13:50 andrewbogott: depooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916 for flavor update
  • 01:20 andrewbogott: repooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912 after flavor update
  • 01:11 andrewbogott: depooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912 for flavor update
  • 01:08 andrewbogott: repooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920 after flavor update
  • 01:00 andrewbogott: depooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920 for flavor update
  • 00:58 andrewbogott: repooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 after flavor update
  • 00:49 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update

2020-09-17

  • 21:56 bd808: Built and deployed tools-manifest v0.22 (T263190)
  • 21:55 bd808: Built and deployed tools-manifest v0.22 (T169695)
  • 20:34 bd808: Live hacked "--backend=gridengine" into webservicemonitor on tools-sgecron-01 (T263190)
  • 20:21 bd808: Restarted webservicemonitor on tools-sgecron-01.tools.eqiad.wmflabs
  • 20:09 andrewbogott: I didn't actually depool tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 because there was some kind of brief outage just now
  • 19:58 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update
  • 19:55 andrewbogott: repooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
  • 19:29 andrewbogott: depooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
  • 15:38 andrewbogott: repooling tools-k8s-worker-70 and tools-k8s-worker-66 after flavor remapping
  • 15:34 andrewbogott: depooling tools-k8s-worker-70 and tools-k8s-worker-66 for flavor remapping
  • 15:30 andrewbogott: repooling tools-sgeexec-0909, 0908, 0907, 0906, 0904
  • 15:21 andrewbogott: depooling tools-sgeexec-0909, 0908, 0907, 0906, 0904 for flavor remapping
  • 13:55 andrewbogott: depooled tools-sgewebgrid-lighttpd-0917 and tools-sgewebgrid-lighttpd-0920
  • 13:55 andrewbogott: repooled tools-sgeexec-0937 after move to ceph
  • 13:45 andrewbogott: depooled tools-sgeexec-0937 for move to ceph

2020-09-16

  • 23:20 andrewbogott: repooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
  • 23:03 andrewbogott: depooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
  • 23:02 andrewbogott: uncordoned tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
  • 22:29 andrewbogott: draining tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
  • 17:37 andrewbogott: service gridengine-master restart on tools-sgegrid-master

2020-09-10

  • 15:37 arturo: hard-rebooting tools-proxy-05
  • 15:33 arturo: rebooting tools-proxy-05 to try flushing local DNS caches
  • 15:25 arturo: detected missing DNS record for k8s.tools.eqiad1.wikimedia.cloud which means the k8s cluster is down
  • 10:22 arturo: enabling ingress dedicated worker nodes in the k8s cluster (T250172)

2020-09-09

2020-09-08

  • 23:24 bstorm: clearing grid queue error states blocking job runs
  • 22:53 bd808: forcing puppet run on tools-sgebastion-07

2020-09-02

  • 18:13 andrewbogott: moving tools-sgeexec-0920 to ceph
  • 17:57 andrewbogott: moving tools-sgeexec-0942 to ceph

2020-08-31

  • 19:58 andrewbogott: migrating tools-sgeexec-091[0-9] to ceph
  • 17:19 andrewbogott: migrating tools-sgeexec-090[4-9] to ceph
  • 17:19 andrewbogott: repooled tools-sgeexec-0901
  • 16:52 bstorm: `apt install uwsgi` was run on tools-checker-03 in the last log T261677
  • 16:51 bstorm: running `apt install uwsgi` with --allow-downgrades to fix the puppet setup there T261677
  • 14:26 andrewbogott: depooling tools-sgeexec-0901, migrating to ceph

2020-08-30

  • 00:57 Krenair: also ran qconf -ds on each
  • 00:35 Krenair: Tidied up SGE problems (it was spamming root@ every minute for hours) following host deletions some hours ago - removed tools-sgeexec-0921 through 0931 from @general, ran qmod -rj on all jobs registered for those nodes, then qdel -f on the remainders, then qconf -de on each deleted node

2020-08-29

  • 16:02 bstorm: deleting "tools-sgeexec-0931", "tools-sgeexec-0930", "tools-sgeexec-0929", "tools-sgeexec-0928", "tools-sgeexec-0927"
  • 16:00 bstorm: deleting "tools-sgeexec-0926", "tools-sgeexec-0925", "tools-sgeexec-0924", "tools-sgeexec-0923", "tools-sgeexec-0922", "tools-sgeexec-0921"

2020-08-26

2020-08-25

  • 19:38 andrewbogott: deleting tools-sgeexec-0943.tools.eqiad.wmflabs, tools-sgeexec-0944.tools.eqiad.wmflabs, tools-sgeexec-0945.tools.eqiad.wmflabs, tools-sgeexec-0946.tools.eqiad.wmflabs, tools-sgeexec-0948.tools.eqiad.wmflabs, tools-sgeexec-0949.tools.eqiad.wmflabs, tools-sgeexec-0953.tools.eqiad.wmflabs — they are broken and we're not very curious why; will retry this exercise when everything is standardized on
  • 15:03 andrewbogott: removing non-ceph nodes tools-sgeexec-0921 through tools-sgeexec-0931
  • 15:02 andrewbogott: added new sge-exec nodes tools-sgeexec-0943 through tools-sgeexec-0953 (for real this time)

2020-08-19

  • 21:29 andrewbogott: shutting down and removing tools-k8s-worker-20 through tools-k8s-worker-29; this load can now be handled by new nodes on ceph hosts
  • 21:15 andrewbogott: shutting down and removing tools-k8s-worker-1 through tools-k8s-worker-19; this load can now be handled by new nodes on ceph hosts
  • 18:40 andrewbogott: creating 13 new xlarge k8s worker nodes, tools-k8s-worker-67 through tools-k8s-worker-79

2020-08-18

  • 15:24 bd808: Rebuilding all Docker containers to pick up newest versions of installed packages

2020-07-30

  • 16:28 andrewbogott: added new xlarge ceph-hosted worker nodes: tools-k8s-worker-61, 62, 63, 64, 65, 66. T258663

2020-07-29

  • 23:24 bd808: Pushed a copy of docker-registry.wikimedia.org/wikimedia-jessie:latest to docker-registry.tools.wmflabs.org/wikimedia-jessie:latest in preparation for the upstream image going away

2020-07-24

  • 22:33 bd808: Removed a few more ancient docker images: grrrit, jessie-toollabs, and nagf
  • 21:02 bd808: Running cleanup script to delete the non-sssd toolforge images from docker-registry.tools.wmflabs.org
  • 20:17 bd808: Forced garbage collection on docker-registry.tools.wmflabs.org
  • 20:06 bd808: Running cleanup script to delete all of the old toollabs-* images from docker-registry.tools.wmflabs.org

2020-07-22

  • 23:24 bstorm: created server group 'tools-k8s-worker' to create any new worker nodes in so that they have a low chance of being scheduled together by openstack unless it is necessary T258663
  • 23:22 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[56-60] T257945
  • 23:17 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[41-55] T257945
  • 23:14 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[21-40] T257945
  • 23:11 bstorm: running puppet and NFS remount on tools-k8s-worker-[1-15] T257945
  • 23:07 bstorm: disabling puppet on k8s workers to reduce the effect of changing the NFS mount version all at once T257945
  • 22:28 bstorm: setting tools-k8s-control prefix to mount NFS v4.2 T257945
  • 22:15 bstorm: set the tools-k8s-control nodes to also use 800MBps to prevent issues with toolforge ingress and api system
  • 22:07 bstorm: set the tools-k8s-haproxy-1 (main load balancer for toolforge) to have an egress limit of 800MB per sec instead of the same as all the other servers

2020-07-21

  • 16:09 bstorm: rebooting tools-sgegrid-shadow to remount NFS correctly
  • 15:55 bstorm: set the bastion prefix to have explicitly set hiera value of profile::wmcs::nfsclient::nfs_version: '4'

2020-07-17

  • 16:47 bd808: Enabled Puppet on tools-proxy-06 following successful test (T102367)
  • 16:29 bd808: Disabled Puppet on tools-proxy-06 to test nginx config changes manually (T102367)

2020-07-15

  • 23:11 bd808: Removed ssh root key for valhallasw from project hiera (T255697)

2020-07-09

  • 18:53 bd808: Updating git-review to 1.27 via clush across cluster (T257496)

2020-07-08

2020-07-07

  • 23:22 bd808: Rebuilding all Docker images to pick up webservice v0.73 (T234617, T257229)
  • 23:19 bd808: Deploying webservice v0.73 via clush (T234617, T257229)
  • 23:16 bd808: Building webservice v0.73 (T234617, T257229)
  • 15:01 Reedy: killed python process from tools.experimental-embeddings using a lot of cpu on tools-sgebastion-07
  • 15:01 Reedy: killed meno25 process running pwb.py on tools-sgebastion-07
  • 09:59 arturo: point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) (T247236)

2020-07-06

  • 11:54 arturo: briefly point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) and then switch back to 185.15.56.11 (tools-proxy-05). The legacy redirector does HTTP/307 (T247236)
  • 11:50 arturo: associate floating IP address 185.15.56.60 to tools-legacy-redirector (T247236)

2020-07-01

2020-06-30

  • 11:18 arturo: set some hiera keys for mtail in puppet prefix `tools-mail` (T256737)

2020-06-29

2020-06-25

  • 21:49 zhuyifei1999_: re-enabling puppet on tools-sgebastion-09 T256426
  • 21:39 zhuyifei1999_: disabling puppet on tools-sgebastion-09 so I can play with mount settings T256426
  • 21:24 bstorm: hard rebooting tools-sgebastion-09

2020-06-24

2020-06-23

  • 17:55 arturo: killed procs for users `hamishz` and `msyn` which apparently were tools that should be running in the grid / kubernetes instead
  • 16:08 arturo: created acme-chief cert `tools_mail` in the prefix hiera

2020-06-17

  • 10:40 arturo: created VM tools-legacy-redirector, with the corresponding puppet prefix (T247236, T234617)

2020-06-16

  • 23:01 bd808: Building new Docker images to pick up webservice 0.72
  • 22:58 bd808: Deploying webservice 0.72 to bastions and grid
  • 22:56 bd808: Building webservice 0.72
  • 15:10 arturo: merging a patch with changes to the template for keepalived (used in the elastic cluster) https://gerrit.wikimedia.org/r/c/operations/puppet/+/605898

2020-06-15

  • 21:28 bstorm_: cleaned up killgridjobs.sh on the tools bastions T157792
  • 18:14 bd808: Rebuilding all Docker images to pick up webservice 0.71 (T254640, T253412)
  • 18:12 bd808: Deploying webservice 0.71 to bastions and grid via clush
  • 18:05 bd808: Building webservice 0.71

2020-06-12

  • 13:13 arturo: live-hacking session in the puppetmaster ended
  • 13:10 arturo: live-hacing puppet tree in tools-puppetmaster-02 for testing PAWS related patch (they share haproxy puppet code)
  • 00:16 bstorm_: remounted NFS for tools-k8s-control-3 and tools-acme-chief-01

2020-06-11

  • 23:35 bstorm_: rebooting tools-k8s-control-2 because it seems to be confused on NFS, interestingly enough

2020-06-04

  • 13:32 bd808: Manually restored /etc/haproxy/conf.d/elastic.cfg on tools-elastic-*

2020-06-02

2020-06-01

  • 23:51 bstorm_: refreshed certs for the custom webhook controllers on the k8s cluster T250874
  • 00:39 bd808: Ugh. Prior SAL message was about tools-sgeexec-0940
  • 00:39 bd808: Compressed /var/log/account/pacct.0 ahead of rotation schedule to free some space on the root partition

2020-05-29

  • 19:37 bstorm_: adding docker image for paws-public docker-registry.tools.wmflabs.org/paws-public-nginx:openresty T252217

2020-05-28

  • 21:19 bd808: Killed 7 python processes run by user 'mattho69' on login.toolforge.org
  • 21:06 bstorm_: upgrading tools-k8s-worker-[30-60] to kubernetes 1.16.10 T246122
  • 17:54 bstorm_: upgraded tools-k8s-worker-[11..15] and starting on -21-29 now T246122
  • 16:01 bstorm_: kubectl upgraded to 1.16.10 on all bastions T246122
  • 15:58 arturo: upgrading tools-k8s-worker-[1..10] to 1.16.10 (T246122)
  • 15:41 arturo: upgrading tools-k8s-control-3 to 1.16.10 (T246122)
  • 15:17 arturo: upgrading tools-k8s-control-2 to 1.16.10 (T246122)
  • 15:09 arturo: upgrading tools-k8s-control-1 to 1.16.10 (T246122)
  • 14:49 arturo: cleanup /etc/apt/sources.list.d/ directory in all tools-k8s-* VMs
  • 11:27 arturo: merging change to front-proxy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/599139 (T253816)

2020-05-27

  • 17:23 bstorm_: deleting "tools-k8s-worker-20", "tools-k8s-worker-19", "tools-k8s-worker-18", "tools-k8s-worker-17", "tools-k8s-worker-16"

2020-05-26

  • 18:45 bstorm_: upgrading maintain-kubeusers to match what is in toolsbeta T246059 T211096
  • 16:20 bstorm_: fix incorrect volume name in kubeadm-config configmap T246122

2020-05-22

  • 20:00 bstorm_: rebooted tools-sgebastion-07 to clear up tmp file problems with 10 min warning
  • 19:12 bstorm_: running command to delete over 2000 tmp ca certs on tools-bastion-07 T253412

2020-05-21

  • 22:40 bd808: Rebuilding all Docker containers for tools-webservice 0.70 (T252700)
  • 22:36 bd808: Updated tools-webservice to 0.70 across instances (T252700)
  • 22:29 bd808: Building tools-webservice 0.70 via wmcs-package-build.py

2020-05-20

  • 09:59 arturo: now running tesseract-ocr v4.1.1-2~bpo9+1 in the Toolforge grid (T247422)
  • 09:50 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'apt-get install tesseract-ocr -t stretch-backports -y'` (T247422)
  • 09:35 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'rm /etc/apt/sources.lists.d/kubeadm-k8s-component-repo.list ; rm /etc/apt/sources.list.d/repository_thirdparty-kubeadm-k8s-1-15.list ; run-puppet-agent'` (T247422)
  • 09:23 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:tools name:tools-sge[bcew].*}' 'rm /etc/apt/preferences.d/* ; run-puppet-agent'` (T247422)

2020-05-19

  • 17:00 bstorm_: deleting/restarting the paws db-proxy pod because it cannot connect to the replicas...and I'm hoping that's due to depooling and such

2020-05-13

  • 18:14 bstorm_: upgrading calico to 3.14.0 with typha enabled in Toolforge K8s T250863
  • 18:10 bstorm_: set "profile::toolforge::k8s::typha_enabled: true" in tools project for calico upgrade T250863

2020-05-09

  • 00:28 bstorm_: added nfs.* to ignored_fs_types for the prometheus::node_exporter params in project hiera T252260

2020-05-08

  • 18:17 bd808: Building all jessie-sssd derived images (T197930)
  • 17:29 bd808: Building new jessie-sssd base image (T197930)

2020-05-07

  • 21:51 bstorm_: rebuilding the docker images for Toolforge k8s
  • 19:03 bstorm_: toollabs-webservice 0.69 now pushed to the Toolforge bastions
  • 18:57 bstorm_: pushing new toollabs-webservice package v0.69 to the tools repos

2020-05-06

  • 21:20 bd808: Kubectl delete node tools-k8s-worker-[16-20] (T248702)
  • 18:24 bd808: Updated "profile::toolforge::k8s::worker_nodes" list in "tools-k8s-haproxy" prefix puppet (T248702)
  • 18:14 bd808: Shutdown tools-k8s-worker-[16-20] instances (T248702)
  • 18:04 bd808: Draining tools-k8s-worker-[16-20] in preparation for decomm (T248702)
  • 17:56 bd808: Cordoned tools-k8s-worker-[16-20] in preparation for decomm (T248702)
  • 00:01 bd808: Joining tools-k8s-worker-60 to the k8s worker pool
  • 00:00 bd808: Joining tools-k8s-worker-59 to the k8s worker pool

2020-05-05

  • 23:58 bd808: Joining tools-k8s-worker-58 to the k8s worker pool
  • 23:55 bd808: Joining tools-k8s-worker-57 to the k8s worker pool
  • 23:53 bd808: Joining tools-k8s-worker-56 to the k8s worker pool
  • 21:51 bd808: Building 5 new k8s worker nodes (T248702)

2020-05-04

  • 22:08 bstorm_: deleting tools-elastic-01/2/3 T236606
  • 16:46 arturo: removing the now unused `/etc/apt/preferences.d/toolforge_k8s_kubeadmrepo*` files (T250866)
  • 16:43 arturo: removing the now unused `/etc/apt/sources.list.d/toolforge-k8s-kubeadmrepo.list` file (T250866)

2020-04-29

  • 22:13 bstorm_: running a fixup script after fixing a bug T247455
  • 21:28 bstorm_: running the rewrite-psp-preset.sh script across all tools T247455
  • 16:54 bstorm_: deleted the maintain-kubeusers pod to start running the new image T247455
  • 16:52 bstorm_: tagged docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to latest to deploy to toolforge T247455

2020-04-28

  • 22:58 bstorm_: rebuilding docker-registry.tools.wmflabs.org/maintain-kubeusers:beta T247455

2020-04-23

  • 19:22 bd808: Increased Kubernetes services quota for bd808-test tool.

2020-04-21

  • 23:06 bstorm_: repooled tools-k8s-worker-38/52, tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 T250869
  • 22:09 bstorm_: depooling tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 T250869
  • 22:02 bstorm_: draining tools-k8s-worker-38 and tools-k8s-worker-52 as they are on the crashed host T250869

2020-04-20

  • 15:31 bd808: Rebuilding Docker containers to pick up tools-webservice v0.68 (T250625)
  • 14:47 arturo: added joakino to tools.admin LDAP group
  • 13:28 jeh: shutdown elasticsearch v5 cluster running Jessie T236606
  • 12:46 arturo: uploading tools-webservice v0.68 to aptly stretch-tools and update it on relevant servers (T250625)
  • 12:06 arturo: uploaded tools-webservice v0.68 to stretch-toolsbeta for testing
  • 11:59 arturo: `root@tools-sge-services-03:~# aptly db cleanup` removed 340 unreferenced packages, and 2 unreferenced files

2020-04-15

  • 23:20 bd808: Building ruby25-sssd/base and children (T141388, T250118)
  • 20:09 jeh: update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 T250206

2020-04-14

  • 18:26 bstorm_: Deployed new code and RBAC for maintain-kubeusers T246123
  • 18:19 bstorm_: updating the maintain-kubeusers:latest image T246123
  • 17:32 bstorm_: updating the maintain-kubeusers:beta image on tools-docker-imagebuilder-01 T246123

2020-04-10

2020-04-09

  • 15:13 bd808: Rebuilding all stretch and buster Docker images. Jessie is broken at the moment due to package version mismatches
  • 11:18 arturo: bump nproc limit in bastions https://gerrit.wikimedia.org/r/c/operations/puppet/+/587715 (T219070)
  • 04:29 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 [try #2] (T154504, T234617)
  • 04:19 bd808: python3 build.py --image-prefix toolforge --tag latest --no-cache --push --single jessie-sssd
  • 00:20 bd808: Docker rebuild failed in toolforge-python2-sssd-base: "zlib1g-dev : Depends: zlib1g (= 1:1.2.8.dfsg-2+b1) but 1:1.2.8.dfsg-2+deb8u1 is to be installed"

2020-04-08

  • 23:49 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 (T154504, T234617)
  • 23:35 bstorm_: deploy toollabs-webservice v0.66 T154504 T234617

2020-04-07

  • 20:06 andrewbogott: sss_cache -E on tools-sgebastion-08 and tools-sgebastion-09
  • 20:00 andrewbogott: sss_cache -E on tools-sgebastion-07

2020-04-06

  • 19:16 bstorm_: deleted tools-redis-1001/2 T248929

2020-04-03

  • 22:40 bstorm_: shut down tools-redis-1001/2 T248929
  • 22:32 bstorm_: switch tools-redis-1003 to the active redis server T248929
  • 20:41 bstorm_: deleting tools-redis-1003/4 to attach them to an anti-affinity group T248929
  • 18:53 bstorm_: spin up tools-redis-1004 on stretch and connect to cluster T248929
  • 18:23 bstorm_: spin up tools-redis-1003 on stretch and connect to the cluster T248929
  • 16:50 bstorm_: launching tools-redis-03 (Buster) to see what happens

2020-03-30

  • 18:28 bstorm_: Beginning rolling depool, remount, repool of k8s workers for T248702
  • 18:22 bstorm_: disabled puppet across tools-k8s-worker-[1-55].tools.eqiad.wmflabs T248702
  • 16:56 arturo: dropping `_psl.toolforge.org` TXT record (T168677)

2020-03-27

  • 21:22 bstorm_: removed puppet prefix tools-docker-builder T248703
  • 21:15 bstorm_: deleted tools-docker-builder-06 T248703
  • 18:55 bstorm_: launching tools-docker-imagebuilder-01 T248703
  • 12:52 arturo: install python3-pykube on tools-k8s-control-3 for some tests interaction with the API from python

2020-03-24

2020-03-18

  • 19:07 bstorm_: removed role::toollabs::logging::sender from project puppet (it wouldn't work anyway)
  • 18:04 bstorm_: removed puppet prefix tools-flannel-etcd T246689
  • 17:58 bstorm_: removed puppet prefix tools-worker T246689
  • 17:57 bstorm_: removed puppet prefix tools-k8s-master T246689
  • 17:36 bstorm_: removed lots of deprecated hiera keys from horizon for the old cluster T246689
  • 16:59 bstorm_: deleting "tools-worker-1002", "tools-worker-1001", "tools-k8s-master-01", "tools-flannel-etcd-03", "tools-k8s-etcd-03", "tools-flannel-etcd-02", "tools-k8s-etcd-02", "tools-flannel-etcd-01", "tools-k8s-etcd-01" T246689

2020-03-17

  • 13:29 arturo: set `profile::toolforge::bastion::nproc: 200` for tools-sgebastion-08 (T219070)
  • 00:08 bstorm_: shut off tools-flannel-etcd-01/02/03 T246689

2020-03-16

  • 22:01 bstorm_: shut off tools-k8s-etcd-01/02/03 T246689
  • 22:00 bstorm_: shut off tools-k8s-master-01 T246689
  • 21:59 bstorm_: shut down tools-worker-1001 and tools-worker-1002 T246689

2020-03-11

  • 17:00 jeh: clean up apt cache on tools-sgebastion-07

2020-03-06

  • 16:25 bstorm_: updating maintain-kubeusers image to filter invalid tool names

2020-03-03

  • 18:16 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad1.wikimedia.cloud (eqiad1 subdomain change) T236606
  • 18:02 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad.wikimedia.cloud T236606
  • 17:31 jeh: create a OpenStack virtual ip address for the new elasticsearch cluster T236606
  • 10:54 arturo: deleted VMs `tools-worker-[1003-1020]` (legacy k8s cluster) (T246689)
  • 10:51 arturo: cordoned/drained all legacy k8s worker nodes except 1001/1002 (T246689)

2020-03-02

  • 22:26 jeh: starting first pass of elasticsearch data migration to new cluster T236606

2020-03-01

2020-02-28

  • 22:14 bstorm_: shutting down the old maintain-kubeusers and taking the gloves off the new one (removing --gentle-mode)
  • 16:51 bstorm_: node/tools-k8s-worker-15 uncordoned
  • 16:44 bstorm_: drained tools-k8s-worker-15 and hard rebooting it because it wasn't happy
  • 16:36 bstorm_: rebooting k8s workers 1-35 on the 2020 cluster to clear a strange nologin condition that has been there since the NFS maintenance
  • 16:14 bstorm_: rebooted tools-k8s-worker-7 to clear some puppet issues
  • 16:00 bd808: Devoicing stashbot in #wikimedia-cloud to reduce irc spam while migrating tools to 2020 Kubernetes cluster
  • 15:28 jeh: create OpenStack server group tools-elastic with anti-affinty policy enabled T236606
  • 15:09 jeh: create 3 new elasticsearch VMs tools-elastic-[1,2,3] T236606
  • 14:20 jeh: create new puppet prefixes for existing (no change in data) and new elasticsearch VMs
  • 04:35 bd808: Joined tools-k8s-worker-54 to 2020 Kubernetes cluster
  • 04:34 bd808: Joined tools-k8s-worker-53 to 2020 Kubernetes cluster
  • 04:32 bd808: Joined tools-k8s-worker-52 to 2020 Kubernetes cluster
  • 04:31 bd808: Joined tools-k8s-worker-51 to 2020 Kubernetes cluster
  • 04:28 bd808: Joined tools-k8s-worker-50 to 2020 Kubernetes cluster
  • 04:24 bd808: Joined tools-k8s-worker-49 to 2020 Kubernetes cluster
  • 04:23 bd808: Joined tools-k8s-worker-48 to 2020 Kubernetes cluster
  • 04:21 bd808: Joined tools-k8s-worker-47 to 2020 Kubernetes cluster
  • 04:21 bd808: Joined tools-k8s-worker-46 to 2020 Kubernetes cluster
  • 04:19 bd808: Joined tools-k8s-worker-45 to 2020 Kubernetes cluster
  • 04:14 bd808: Joined tools-k8s-worker-44 to 2020 Kubernetes cluster
  • 04:13 bd808: Joined tools-k8s-worker-43 to 2020 Kubernetes cluster
  • 04:12 bd808: Joined tools-k8s-worker-42 to 2020 Kubernetes cluster
  • 04:10 bd808: Joined tools-k8s-worker-41 to 2020 Kubernetes cluster
  • 04:09 bd808: Joined tools-k8s-worker-40 to 2020 Kubernetes cluster
  • 04:08 bd808: Joined tools-k8s-worker-39 to 2020 Kubernetes cluster
  • 04:07 bd808: Joined tools-k8s-worker-38 to 2020 Kubernetes cluster
  • 04:06 bd808: Joined tools-k8s-worker-37 to 2020 Kubernetes cluster
  • 03:49 bd808: Joined tools-k8s-worker-36 to 2020 Kubernetes cluster
  • 00:50 bstorm_: rebuilt all docker images to include webservice 0.64

2020-02-27

  • 23:27 bstorm_: installed toollabs-webservice 0.64 on the bastions
  • 23:24 bstorm_: pushed toollabs-webservice version 0.64 to all toolforge repos
  • 21:03 jeh: add reindex service account to elasticsearch for data migration T236606
  • 20:57 bstorm_: upgrading toollabs-webservice to stretch-toolsbeta version for jdk8:testing image only
  • 20:19 jeh: update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 T236606
  • 18:53 bstorm_: hard rebooted a rather stuck tools-sgecron-01
  • 18:20 bd808: Building tools-k8s-worker-[36-55]
  • 17:56 bd808: Deleted instances tools-worker-10[21-40]
  • 16:14 bd808: Decommissioning tools-worker-10[21-40]
  • 16:02 bd808: Drained tools-worker-1021
  • 15:51 bd808: Drained tools-worker-1022
  • 15:44 bd808: Drained tools-worker-1023 (there is no tools-worker-1024)
  • 15:39 bd808: Drained tools-worker-1025
  • 15:39 bd808: Drained tools-worker-1026
  • 15:11 bd808: Drained tools-worker-1027
  • 15:09 bd808: Drained tools-worker-1028 (there is no tools-worker-1029)
  • 15:07 bd808: Drained tools-worker-1030
  • 15:06 bd808: Uncordoned tools-worker-10[16-20]. Was over optimistic about repacking legacy Kubernetes cluster into 15 instances. Will keep 20 for now.
  • 15:00 bd808: Drained tools-worker-1031
  • 14:54 bd808: Hard reboot tools-worker-1016. Direct virsh console unresponsive. Stuck in shutdown since 2020-01-22?
  • 14:44 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
  • 14:41 bd808: Drained tools-worker-1032
  • 14:37 bd808: Drained tools-worker-1033
  • 14:35 bd808: Drained tools-worker-1034
  • 14:34 bd808: Drained tools-worker-1035
  • 14:33 bd808: Drained tools-worker-1036
  • 14:33 bd808: Drained tools-worker-10{39,38,37} yesterday but did not !log
  • 00:29 bd808: Drained tools-worker-1009 for reboot (NFS flakey)
  • 00:11 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
  • 00:08 bd808: Uncordoned tools-worker-1002.tools.eqiad.wmflabs
  • 00:02 bd808: Rebooting tools-worker-1002
  • 00:00 bd808: Draining tools-worker-1002 to reboot for NFS problems

2020-02-26

  • 23:42 bd808: Drained tools-worker-1040
  • 23:41 bd808: Cordoned tools-worker-10[16-40] in preparation for shrinking legacy Kubernetes cluster
  • 23:12 bstorm_: replacing all tool limit-ranges in the 2020 cluster with a lower cpu request version
  • 22:29 bstorm_: deleted pod maintain-kubeusers-6d9c45f4bc-5bqq5 to deploy new image
  • 21:06 bstorm_: deleting loads of stuck grid jobs
  • 20:27 jeh: rebooting tools-worker-[1008,1015,1021]
  • 20:15 bstorm_: rebooting tools-sgegrid-master because it actually had the permissions thing going on still
  • 18:03 bstorm_: downtimed toolschecker for nfs maintenance

2020-02-25

  • 15:31 bd808: `wmcs-k8s-enable-cluster-monitor toolschecker`

2020-02-23

2020-02-21

  • 16:02 andrewbogott: moving tools-sgecron-01 to cloudvirt1022

2020-02-20

  • 14:49 andrewbogott: moving tools-k8s-worker-19 and tools-k8s-worker-18 to cloudvirt1022 (as part of draining 1014)
  • 00:04 Krenair: Shut off tools-puppetmaster-01 - to be deleted in one week T245365

2020-02-19

  • 22:05 Krenair: Project-wide hiera change to swap puppetmaster to tools-puppetmaster-02 T245365
  • 15:36 bstorm_: setting 'puppetmaster: tools-puppetmaster-02.tools.eqiad.wmflabs' on tools-sgeexec-0942 to test new puppetmaster on grid T245365
  • 11:50 arturo: fix invalid yaml format in horizon puppet prefix 'tools-k8s-haproxy' that prevented clean puppet run in the VMs
  • 00:59 bd808: Live hacked the "nginx-configuration" ConfigMap for T245426 (done several hours ago, but I forgot to !log it)

2020-02-18

  • 23:26 bstorm_: added tools-sgegrid-master.tools.eqiad1.wikimedia.cloud and tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud to gridengine admin host lists
  • 09:50 arturo: temporarily delete DNS zone tools.wmcloud.org to try re-creating it

2020-02-17

2020-02-14

  • 00:38 bd808: Added tools-k8s-worker-35 to 2020 Kubernetes cluster (T244791)
  • 00:34 bd808: Added tools-k8s-worker-34 to 2020 Kubernetes cluster (T244791)
  • 00:32 bd808: Added tools-k8s-worker-33 to 2020 Kubernetes cluster (T244791)
  • 00:29 bd808: Added tools-k8s-worker-32 to 2020 Kubernetes cluster (T244791)
  • 00:25 bd808: Added tools-k8s-worker-31 to 2020 Kubernetes cluster (T244791)
  • 00:25 bd808: Added tools-k8s-worker-30 to 2020 Kubernetes cluster (T244791)
  • 00:17 bd808: Added tools-k8s-worker-29 to 2020 Kubernetes cluster (T244791)
  • 00:15 bd808: Added tools-k8s-worker-28 to 2020 Kubernetes cluster (T244791)
  • 00:13 bd808: Added tools-k8s-worker-27 to 2020 Kubernetes cluster (T244791)
  • 00:07 bd808: Added tools-k8s-worker-26 to 2020 Kubernetes cluster (T244791)
  • 00:03 bd808: Added tools-k8s-worker-25 to 2020 Kubernetes cluster (T244791)

2020-02-13

  • 23:53 bd808: Added tools-k8s-worker-24 to 2020 Kubernetes cluster (T244791)
  • 23:50 bd808: Added tools-k8s-worker-23 to 2020 Kubernetes cluster (T244791)
  • 23:38 bd808: Added tools-k8s-worker-22 to 2020 Kubernetes cluster (T244791)
  • 21:35 bd808: Deleted tools-sgewebgrid-lighttpd-092{1,2,3,4,5,6,7,8} & tools-sgewebgrid-generic-090{3,4} (T244791)
  • 21:33 bd808: Removed tools-sgewebgrid-lighttpd-092{1,2,3,4,5,6,7,8} & tools-sgewebgrid-generic-090{3,4} from grid engine config (T244791)
  • 17:43 andrewbogott: migrating b24e29d7-a468-4882-9652-9863c8acfb88 to cloudvirt1022

2020-02-12

  • 19:29 bd808: Rebuilding all Docker images to pick up toollabs-webservice (0.63) (T244954)
  • 19:15 bd808: Deployed toollabs-webservice (0.63) on bastions (T244954)
  • 00:20 bd808: Depooling tools-sgewebgrid-generic-0903 (T244791)
  • 00:19 bd808: Depooling tools-sgewebgrid-generic-0904 (T244791)
  • 00:14 bd808: Depooling tools-sgewebgrid-lighttpd-0921 (T244791)
  • 00:09 bd808: Depooling tools-sgewebgrid-lighttpd-0922 (T244791)
  • 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0923 (T244791)
  • 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0924 (T244791)

2020-02-11

  • 23:58 bd808: Depooling tools-sgewebgrid-lighttpd-0925 (T244791)
  • 23:56 bd808: Depooling tools-sgewebgrid-lighttpd-0926 (T244791)
  • 23:38 bd808: Depooling tools-sgewebgrid-lighttpd-0927 (T244791)

2020-02-10

  • 23:39 bstorm_: updated tools-manifest to 0.21 on aptly for stretch
  • 22:51 bstorm_: all docker images now use webservice 0.62
  • 22:01 bd808: Manually starting webservices for tools that were running on tools-sgewebgrid-lighttpd-0928 (T244791)
  • 21:47 bd808: Depooling tools-sgewebgrid-lighttpd-0928 (T244791)
  • 21:25 bstorm_: upgraded toollabs-webservice package for tools to 0.62 T244293 T244289 T234617 T156626

2020-02-07

  • 10:55 arturo: drop jessie VM instances tools-prometheus-{01,02} which were shutdown (T238096)

2020-02-06

2020-02-05

  • 11:22 arturo: restarting ferm fleet-wide to account for prometheus servers changed IP (but same hostname) (T238096)

2020-02-04

  • 11:38 arturo: start again tools-prometheus-01 again to sync data to the new tools-prometheus-03/04 VMs (T238096)
  • 11:37 arturo: re-create tools-prometheus-03/04 as 'bigdisk2' instances (300GB) T238096

2020-02-03

  • 14:12 arturo: move tools-prometheus-04 from cloudvirt1022 to cloudvirt1013
  • 12:48 arturo: shutdown tools-prometheus-01 and tools-prometheus-02, after fixing the proxy `tools-prometheus.wmflabs.org` to tools-prometheus-03, data synced (T238096)
  • 09:38 arturo: tools-prometheus-01: systemctl stop prometheus@tools. Another try to migrate data to tools-prometheus-{03,04} (T238096)

2020-01-31

  • 14:06 arturo: leave tools-prometheus-01 as the backend for tools-prometheus.wmflabs.org for the weekend so grafana dashboards keep working (T238096)
  • 14:00 arturo: syncing again prometheus data from tools-prometheus-01 to tools-prometheus-0{3,4} due to some inconsistencies preventing prometheus from starting (T238096)

2020-01-30

  • 21:04 andrewbogott: also apt-get install python3-novaclient on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam. Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
  • 20:39 andrewbogott: apt-get install python3-keystoneclient on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam. Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
  • 16:27 arturo: create VM tools-prometheus-04 as cold standby of tools-prometheus-03 (T238096)
  • 16:25 arturo: point tools-prometheus.wmflabs.org proxy to tools-prometheus-03 (T238096)
  • 13:42 arturo: disable puppet in prometheus servers while syncing metric data (T238096)
  • 13:15 arturo: drop floating IP 185.15.56.60 and FQDN `prometheus.tools.wmcloud.org` because this is not how the prometheus setup is right now. Use a web proxy instead `tools-prometheus-new.wmflabs.org` (T238096)
  • 13:09 arturo: created FQDN `prometheus.tools.wmcloud.org` pointing to IPv4 185.15.56.60 (tools-prometheus-03) to test T238096
  • 12:59 arturo: associated floating IPv4 185.15.56.60 to tools-prometheus-03 (T238096)
  • 12:57 arturo: created domain `tools.wmcloud.org` in the tools project after some back and forth with designated, permissions and the database. I plan to use this domain to test the new Debian Buster-based prometheus setup (T238096)
  • 10:20 arturo: create new VM instance tools-prometheus-03 (T238096)

2020-01-29

  • 20:07 bd808: Created {bastion,login,dev}.toolforge.org service names for Toolforge bastions using Horizon & Designate

2020-01-28

  • 13:35 arturo: `aborrero@tools-clushmaster-02:~$ clush -w @exec-stretch 'for i in $(ps aux | grep [t]ools.j | awk -F" " "{print \$2}") ; do echo "killing $i" ; sudo kill $i ; done || true'` (T243831)

2020-01-27

  • 07:05 zhuyifei1999_: wrong package. uninstalled. the correct one is bpfcc-tools and seems only available in buster+. T115231
  • 07:01 zhuyifei1999_: apt installing bcc on tools-worker-1037 to see who is sending SIGTERM, will uninstall after done. dependency: bin86. T115231

2020-01-24

  • 20:58 bd808: Built tools-k8s-worker-21 to test out build script following openstack client upgrade
  • 15:45 bd808: Rebuilding all Docker containers again because I failed to actually update the build server git clone properly last time I did this
  • 05:23 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster (take 2)
  • 04:41 bd808: Rebuilding all Docker images to pick up webservice-python-bootstrap changes

2020-01-23

  • 23:38 bd808: Halted tools-k8s-worker build script after first instance (tools-k8s-worker-10) stuck in "scheduling" state for 20 minutes
  • 23:16 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster
  • 05:15 bd808: Building tools-elastic-04
  • 04:39 bd808: wmcs-openstack quota set --instances 192
  • 04:36 bd808: wmcs-openstack quota set --cores 768 --ram 1536000

2020-01-22

  • 12:43 arturo: for the record, issue with tools-worker-1016 was memory exhaustion apparently
  • 12:35 arturo: hard-reboot tools-worker-1016 (not responding to even console access)

2020-01-21

  • 19:25 bstorm_: hard rebooting tools-sgeexec-0913/14/35 because they aren't even on the network
  • 19:17 bstorm_: depooled and rebooted tools-sgeexec-0914 because it was acting funny
  • 18:30 bstorm_: depooling and rebooting tools-sgeexec-[0911,0913,0919,0921,0924,0931,0933,0935,0939,0941].tools.eqiad.wmflabs
  • 17:21 bstorm_: rebooting toolschecker to recover stale nfs handle

2020-01-16

  • 23:54 bstorm_: rebooting tools-docker-builder-06 because there are a couple running containers that don't want to die cleanly
  • 23:45 bstorm_: rebuilding docker containers to include new webservice version (0.58)
  • 23:41 bstorm_: deployed toollabs-webservice 0.58 to everything that isn't a container
  • 16:45 bstorm_: ran configurator to set the gridengine web queues to `rerun FALSE` T242397

2020-01-14

  • 15:29 bstorm_: failed the gridengine master back to the master server from the shadow
  • 02:23 andrewbogott: rebooting tools-paws-worker-1006 to resolve hangs associated with an old NFS failure

2020-01-13

  • 17:48 bd808: Running `puppet ca destroy` for each unsigned cert on tools-puppetmaster-01 (T242642)
  • 16:42 bd808: Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. T242559
  • 16:33 bd808: Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. T242559
  • 16:31 bd808: Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. T242559
  • 16:26 bd808: Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. T242559

2020-01-12

  • 22:31 Krenair: same on -13 and -14
  • 22:28 Krenair: same on -8
  • 22:18 Krenair: same on -7
  • 22:11 Krenair: Did usual new instance creation puppet dance on tools-k8s-worker-6, /data/project got created

2020-01-11

  • 01:33 bstorm_: updated toollabs-webservice package to 0.57, which should allow persisting mem and cpu in manifests with burstable qos.

2020-01-10

  • 23:31 bstorm_: updated toollabs-webservice package to 0.56
  • 15:45 bstorm_: depooled tools-paws-worker-1013 to reboot because I think it is the last tools server with that mount issue (I hope)
  • 15:35 bstorm_: depooling and rebooting tools-worker-1016 because it still had the leftover mount problems
  • 15:30 bstorm_: git stash-ing local puppet changes in hopes that arturo has that material locally, and it doesn't break anything to do so

2020-01-09

  • 23:35 bstorm_: depooled tools-sgeexec-0939 because it isn't acting right and rebooting it
  • 18:26 bstorm_: re-joining the k8s nodes OF THE PAWS CLUSTER to the cluster one at a time to rotate the certs T242353
  • 18:25 bstorm_: re-joining the k8s nodes to the cluster one at a time to rotate the certs T242353
  • 18:06 bstorm_: rebooting tools-paws-master-01 T242353
  • 17:46 bstorm_: refreshing the paws cluster's entire x509 environment T242353

2020-01-07

  • 22:40 bstorm_: rebooted tools-worker-1007 to recover it from disk full and general badness
  • 16:33 arturo: deleted by hand pod metrics/cadvisor-5pd46 due to prometheus having issues scrapping it
  • 15:46 bd808: Rebooting tools-k8s-worker-[6-14]
  • 15:35 bstorm_: changed kubeadm-config to use a list instead of a hash for extravols on the apiserver in the new k8s cluster T242067
  • 14:02 arturo: `root@tools-k8s-control-3:~# wmcs-k8s-secret-for-cert -n metrics -s metrics-server-certs -a metrics-server` (T241853)
  • 13:33 arturo: upload docker-registry.tools.wmflabs.org/coreos/kube-state-metrics:v1.8.0 copied from quay.io/coreos/kube-state-metrics:v1.8.0 (T241853)
  • 13:31 arturo: upload docker-registry.tools.wmflabs.org/metrics-server-amd64:v0.3.6 copied from k8s.gcr.io/metrics-server-amd64:v0.3.6 (T241853)
  • 13:23 arturo: [new k8s] doing changes to kube-state-metrics and metrics-server trying to relocate them to the 'metrics' namespace (T241853)
  • 05:28 bd808: Creating tools-k8s-worker-[6-14] (again)
  • 05:20 bd808: Deleting busted tools-k8s-worker-[6-14]
  • 05:02 bd808: Creating tools-k8s-worker-[6-14]
  • 00:26 bstorm_: repooled tools-sgewebgrid-lighttpd-0919
  • 00:17 bstorm_: repooled tools-sgewebgrid-lighttpd-0918
  • 00:15 bstorm_: moving tools-sgewebgrid-lighttpd-0918 and -0919 to cloudvirt1004 from cloudvirt1029 to rebalance load
  • 00:02 bstorm_: depooled tools-sgewebgrid-lighttpd-0918 and 0919 to move to cloudvirt1004 to improve spread

2020-01-06

  • 23:40 bd808: Deleted tools-sgewebgrid-lighttpd-09{0[1-9],10}
  • 23:36 bd808: Shutdown tools-sgewebgrid-lighttpd-09{0[1-9],10}
  • 23:34 bd808: Decommissioned tools-sgewebgrid-lighttpd-09{0[1-9],10}
  • 23:13 bstorm_: Repooled tools-sgeexec-0922 because I don't know why it was depooled
  • 23:01 bd808: Depooled tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs
  • 22:58 bd808: Depooling tools-sgewebgrid-lighttpd-090[2-9]
  • 22:57 bd808: Disabling queues on tools-sgewebgrid-lighttpd-090[2-9]
  • 21:07 bd808: Restarted kube2proxy on tools-proxy-05 to try and refresh admin tool's routes
  • 18:54 bstorm_: edited /etc/fstab to remove NFS and unmounted the nfs volumes tools-k8s-haproxy-1 T241908
  • 18:49 bstorm_: edited /etc/fstab to remove NFS and rebooted to clear stale mounts on tools-k8s-haproxy-2 T241908
  • 18:47 bstorm_: added mount_nfs=false to tools-k8s-haproxy puppet prefix T241908
  • 18:24 bd808: Deleted shutdown instance tools-worker-1029 (was an SSSD testing instance)
  • 16:42 bstorm_: failed sge-shadow-master back to the main grid master
  • 16:42 bstorm_: Removed files for old S1tty that wasn't working on sge-grid-master

2020-01-04

  • 18:11 bd808: Shutdown tools-worker-1029
  • 18:10 bd808: kubectl delete node tools-worker-1029.tools.eqiad.wmflabs
  • 18:06 bd808: Removed tools-worker-1029.tools.eqiad.wmflabs from k8s::worker_hosts hiera in preparation for decom
  • 16:54 bstorm_: moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884
  • 16:47 bstorm_: moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors T241884
  • 16:16 bd808: Draining tools-worker-10{05,12,28} due to hardware errors (T241884)
  • 16:13 arturo: moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:11 arturo: moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:09 arturo: moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:08 arturo: moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:07 arturo: moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:06 arturo: moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:04 arturo: moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241884)
  • 16:02 arturo: moving VM tools-sgeexec-0910 from cloudvirt1024 to cloudvirt1009 due to hardware errors (T241873)

2020-01-03

  • 16:48 bstorm_: updated the ValidatingWebhookConfiguration for the ingress admission controller to the working settings
  • 11:51 arturo: [new k8s] deploy cadvisor as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/561654 (T237643)
  • 11:21 arturo: upload k8s.gcr.io/cadvisor:v0.30.2 docker image to the docker registry as docker-registry.tools.wmflabs.org/cadvisor:0.30.2 for T237643
  • 03:04 bd808: Really rebuilding all {jessie,stretch,buster}-sssd images. Last time I forgot to actually update the git clone.
  • 00:11 bd808: Rebuiliding all stretch-ssd Docker images to pick up busybox

2020-01-02

  • 23:54 bd808: Rebuiliding all buster-ssd Docker images to pick up busybox

2019-12-30

  • 05:02 andrewbogott: moving tools-worker-1012 to cloudvirt1024 for T241523
  • 04:49 andrewbogott: draining and rebooting tools-worker-1031, its drive is full

2019-12-29

  • 01:38 Krenair: Cordoned tools-worker-1012 and deleted pods associated with dplbot and dewikigreetbot as well as my own testing one, host seems to be under heavy load - T241523

2019-12-27

  • 15:06 Krenair: Killed a "python parse_page.py outreachy" process by aikochou that was hogging IO on tools-sgebastion-07

2019-12-25

  • 16:07 zhuyifei1999_: pkilled 5 `python pwb.py` processes belonging to `tools.kaleem-bot` on tools-sgebastion-07

2019-12-22

  • 20:13 bd808: Enabled Puppet on tools-proxy-06.tools.eqiad.wmflabs after nginx config test (T241310)
  • 18:52 bd808: Disabled Puppet on tools-proxy-06.tools.eqiad.wmflabs to test nginx config change (T241310)

2019-12-20

  • 22:28 bd808: Re-enabled Puppet on tools-sgebastion-09. Reason for disable was "arturo raising systemd limits"
  • 11:33 arturo: reboot tools-k8s-control-3 to fix some stale NFS mount issues

2019-12-18

  • 17:33 bstorm_: updated package in aptly for toollabs-webservice to 0.53
  • 11:49 arturo: introduce placeholder DNS records for toolforge.org domain. No services are provided under this domain yet for end users, this is just us testing (SSL, proxy stuff etc). This may be reverted anytime.

2019-12-17

  • 20:25 bd808: Fixed https://tools.wmflabs.org/ to redirect to https://tools.wmflabs.org/admin/
  • 19:21 bstorm_: deployed the changes to the live proxy to enable the new kubernetes cluster T234037
  • 16:53 bstorm_: maintain-kubeusers app deployed fully in tools for new kubernetes cluster T214513 T228499
  • 16:50 bstorm_: updated the maintain-kubeusers docker image for beta and tools
  • 04:48 bstorm_: completed first run of maintain-kubeusers 2 in the new cluster T214513
  • 01:26 bstorm_: running the first run of maintain-kubeusers 2.0 for the new cluster T214513 (more successfully this time)
  • 01:25 bstorm_: unset the immutable bit from 1704 tool kubeconfigs T214513
  • 01:05 bstorm_: beginning the first run of the new maintain-kubeusers in gentle-mode -- but it was just killed by some files setting the immutable bit T214513
  • 00:45 bstorm_: enabled encryption at rest on the new k8s cluster

2019-12-16

  • 22:04 bd808: Added 'ALLOW IPv4 25/tcp from 0.0.0.0/0' to "MTA" security group applied to tools-mail-02
  • 19:05 bstorm_: deployed the maintain-kubeusers operations pod to the new cluster

2019-12-14

  • 10:48 valhallasw`cloud: re-enabling puppet on tools-sgeexec-0912, likely left-over from NFS maintenance (no reason was specified).

2019-12-13

  • 18:46 bstorm_: updated tools-k8s-control-2 and 3 to the new config as well
  • 17:56 bstorm_: updated tools-k8s-control-1 to the new control plane configuration
  • 17:47 bstorm_: edited kubeadm-config configMap object to match the new init config
  • 17:32 bstorm_: rebooting tools-k8s-control-2 to correct mount issue
  • 00:45 bstorm_: rebooting tools-static-13
  • 00:28 bstorm_: rebooting the k8s master to clear NFS errors
  • 00:15 bstorm_: switch tools-acme-chief config to match the new authdns_servers format upstream

2019-12-12

  • 23:36 bstorm_: rebooting toolschecker after downtiming the services
  • 22:58 bstorm_: rebooting tools-acme-chief-01
  • 22:53 bstorm_: rebooting the cron server, tools-sgecron-01 as it wasn't recovered from last night's maintenance
  • 11:20 arturo: rolling reboot for all grid & k8s worker nodes due to NFS staleness
  • 09:22 arturo: reboot tools-sgeexec-0911 to try fixing weird NFS state
  • 08:46 arturo: doing `run-puppet-agent` in all VMs to see state of NFS
  • 08:34 arturo: reboot tools-worker-1033/1034 and tools-sgebastion-08 to try to correct NFS mount issues

2019-12-11

  • 18:13 bd808: Restarted maintain-dbusers on labstore1004. Process had not logged any account creations since 2019-12-01T22:45:45.
  • 17:24 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1031

2019-12-10

  • 13:59 arturo: set pod replicas to 3 in the new k8s cluster (T239405)

2019-12-09

  • 11:06 andrewbogott: deleting unused security groups: catgraph, devpi, MTA, mysql, syslog, test T91619

2019-12-04

  • 13:45 arturo: drop puppet prefix `tools-cron`, deprecated and no longer in use

2019-11-29

  • 11:45 arturo: created 3 new VMs `tools-k8s-worker-[3,4,5]` (T239403)
  • 10:28 arturo: re-arm keyholder in tools-acme-chief-01 (password in labs/private.git @ tools-puppetmaster-01)
  • 10:27 arturo: re-arm keyholder in tools-acme-chief-02 (password in labs/private.git @ tools-puppetmaster-01)

2019-11-26

  • 23:25 bstorm_: rebuilding docker images to include the new webservice 0.52 in all versions instead of just the stretch ones T236202
  • 22:57 bstorm_: push upgraded webservice 0.52 to the buster and jessie repos for container rebuilds T236202
  • 19:55 phamhi: drained tools-worker-1002,8,15,32 to rebalance the cluster
  • 19:45 phamhi: cleaned up container that was taken up 16G of disk space on tools-worker-1020 in order to re-run puppet client
  • 14:01 arturo: drop hiera references to `tools-test-proxy-01.tools.eqiad.wmflabs`. Such VM no longer exists
  • 14:00 arturo: introduce the `profile::toolforge::proxies` hiera key in the global puppet config

2019-11-25

  • 10:35 arturo: refresh puppet certs for tools-k8s-etcd-[4-6] nodes (T238655)
  • 10:35 arturo: add puppet cert SANs via instance hiera to tools-k8s-etcd-[4-6] nodes (T238655)

2019-11-22

  • 13:32 arturo: created security group `tools-new-k8s-full-connectivity` and add new k8s VMs to it (T238654)
  • 05:55 jeh: add Riley Huntley `riley` to base tools project

2019-11-21

  • 12:48 arturo: reboot the new k8s cluster after the upgrade
  • 11:49 arturo: upgrading new k8s kubectl version to 1.15.6 (T238654)
  • 11:44 arturo: upgrading new k8s kubelet version to 1.15.6 (T238654)
  • 10:29 arturo: upgrading new k8s cluster version to 1.15.6 using kubeadm (T238654)
  • 10:28 arturo: install kubeadm 1.15.6 on worker/control nodes in the new k8s cluster (T238654)

2019-11-19

  • 13:49 arturo: re-create nginx-ingress pod due to deployment template refresh (T237643)
  • 12:46 arturo: deploy changes to tools-prometheus to account for the new k8s cluster (T237643)

2019-11-15

  • 14:44 arturo: stop live-hacks on tools-prometheus-01 T237643

2019-11-13

  • 17:20 arturo: live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster (T237643)

2019-11-12

  • 12:52 arturo: reboot tools-proxy-06 to reset iptables setup T238058

2019-11-10

2019-11-08

  • 22:47 bstorm_: adding rsync::server::wrap_with_stunnel: false to the tools-docker-registry-03/4 servers to unbreak puppet
  • 18:40 bstorm_: pushed new webservice package to the bastions T230961
  • 18:37 bstorm_: pushed new webservice package supporting buster containers to repo T230961
  • 18:36 bstorm_: pushed buster-sssd images to the docker repo
  • 17:15 phamhi: pushed new buster images with the prefix name "toolforge"

2019-11-07

  • 13:27 arturo: deployed registry-admission-webhook and ingress-admission-controller into the new k8s cluster (T236826)
  • 13:01 arturo: creating puppet prefix `tools-k8s-worker` and a couple of VMs `tools-k8s-worker-[1,2]` T236826
  • 12:57 arturo: increasing project quota T237633
  • 11:54 arturo: point `k8s.tools.eqiad1.wikimedia.cloud` to tools-k8s-haproxy-1 T236826
  • 11:43 arturo: create VMs `tools-k8s-haproxy-[1,2]` T236826
  • 11:43 arturo: create puppet prefix `tools-k8s-haproxy` T236826

2019-11-06

  • 22:32 bstorm_: added rsync::server::wrap_with_stunnel: false to tools-sge-services prefix to fix puppet
  • 21:33 bstorm_: docker images needed for kubernetes cluster upgrade deployed T215531
  • 20:26 bstorm_: building and pushing docker images needed for kubernetes cluster upgrade
  • 16:10 arturo: new k8s cluster control nodes are bootstrapped (T236826)
  • 13:51 arturo: created FQDN `k8s.tools.eqiad1.wikimedia.cloud` pointing to `tools-k8s-control-1` for the initial bootstrap (T236826)
  • 13:50 arturo: created 3 VMs`tools-k8s-control-[1,2,3]` (T236826)
  • 13:43 arturo: created `tools-k8s-control` puppet prefix T236826
  • 11:57 phamhi: restarted all webservices in grid (T233347)

2019-11-05

  • 23:08 Krenair: Dropped 59a77a3, 3830802, and 83df61f from tools-puppetmaster-01:/var/lib/git/labs/private cherry-picks as these are no longer required T206235
  • 22:49 Krenair: Disassociated floating IP 185.15.56.60 from tools-static-13, traffic to this host goes via the project-proxy now. DNS was already changed a few days ago. T236952
  • 22:35 bstorm_: upgraded libpython3.4 libpython3.4-dbg libpython3.4-minimal libpython3.4-stdlib python3.4 python3.4-dbg python3.4-minimal to fix an old broken patch T237468
  • 22:12 bstorm_: pushed docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to the registry to deploy in toolsbeta
  • 17:38 phamhi: restarted lighttpd based webservice pods on tools-worker-103x and 1040 (T233347)
  • 17:34 phamhi: restarted lighttpd based webservice pods on tools-worker-102[0-9] (T233347)
  • 17:06 phamhi: restarted lighttpd based webservice pods on tools-worker-101[0-9] (T233347)
  • 16:44 phamhi: restarted lighttpd based webservice pods on tools-worker-100[1-9] (T233347)
  • 13:55 arturo: created 3 new VMs: `tools-k8s-etcd-[4,5,6]` T236826

2019-11-04

  • 14:45 phamhi: Built and pushed ruby25 docker image based on buster (T230961)
  • 14:45 phamhi: Built and pushed golang111 docker image based on buster (T230961)
  • 14:45 phamhi: Built and pushed jdk11 docker image based on buster (T230961)
  • 14:45 phamhi: Built and pushed php73 docker image based on buster (T230961)
  • 11:10 phamhi: Built and pushed python37 docker image based on buster (T230961)

2019-11-01

  • 21:00 Krenair: Removed tools-checker.wmflabs.org A record to 208.80.155.229 as that target IP is in the old pre-neutron range that is no longer routed
  • 20:57 Krenair: Removed trusty.tools.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
  • 20:56 Krenair: Removed tools-trusty.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
  • 20:38 Krenair: Updated A record for tools-static.wmflabs.org to point towards project-proxy T236952

2019-10-31

  • 18:47 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1001. Runaway logfiles filled up the drive which prevented puppet from running. If puppet had run, it would have prevented the runaway logfiles.
  • 13:59 arturo: update puppet prefix `tools-k8s-etcd-` to use the `role::wmcs::toolforge::k8s::etcd` T236826
  • 13:41 arturo: disabling puppet in tools-k8s-etcd- nodes to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/546995
  • 10:15 arturo: SSL cert replacement for tools-docker-registry and tools-k8s-master went fine apparently (T236962)
  • 10:02 arturo: icinga downtime toolschecker for 1h for replacing SSL certs in tools-docker-registry and tools-k8s-master (T236962)

2019-10-30

2019-10-29

  • 10:49 arturo: deleting VMs tools-test-proxy-01, no longer in use
  • 10:07 arturo: deleting old jessie VMs tools-proxy-03/04 T235627

2019-10-28

  • 16:06 arturo: delete VM instance `tools-test-proxy-01` and the puppet prefix `tools-test-proxy`
  • 15:54 arturo: tools-proxy-05 has now the 185.15.56.11 floating IP as active proxy. Old one 185.15.56.6 has been freed T235627
  • 15:54 arturo: shutting down tools-proxy-03 T235627
  • 15:26 bd808: Killed all processes owned by jem on tools-sgebastion-08
  • 15:16 arturo: tools-proxy-05 has now the 185.15.56.5 floating IP as active proxy T235627
  • 15:14 arturo: refresh hiera to use tools-proxy-05 as active proxy T235627
  • 15:11 bd808: Killed ircbot.php processes started by jem on tools-sgebastion-08 per request on irc
  • 14:58 arturo: added `webproxy` security group to tools-proxy-05 and tools-proxy-06 (T235627)
  • 14:57 phamhi: drained tools-worker-1031.tools.eqiad.wmflabs to clean up disk space
  • 14:45 arturo: created VMs tools-proxy-05 and tools-proxy-06 (T235627)
  • 14:43 arturo: adding `role::wmcs::toolforge::proxy` to the `tools-proxy` puppet prefix (T235627)
  • 14:42 arturo: deleted `role::toollabs::proxy` from the `tools-proxy` puppet profile (T235627)
  • 14:34 arturo: icinga downtime toolschecker for 1h (T235627)
  • 12:25 arturo: upload image `coredns` v1.3.1 (eb516548c180) to docker registry (T236249)
  • 12:23 arturo: upload image `kube-apiserver` v1.15.1 (68c3eb07bfc3) to docker registry (T236249)
  • 12:22 arturo: upload image `kube-controller-manager` v1.15.1 (d75082f1d121) to docker registry (T236249)
  • 12:20 arturo: upload image `kube-proxy` v1.15.1 (89a062da739d) to docker registry (T236249)
  • 12:19 arturo: upload image `kube-scheduler` v1.15.1 (b0b3c4c404da) to docker registry (T236249)
  • 12:04 arturo: upload image `calico/node` v3.8.0 (cd3efa20ff37) to docker registry (T236249)
  • 12:03 arturo: upload image `calico/calico/pod2daemon-flexvol` v3.8.0 (f68c8f870a03) to docker registry (T236249)
  • 12:01 arturo: upload image `calico/cni` v3.8.0 (539ca36a4c13) to docker registry (T236249)
  • 11:58 arturo: upload image `calico/kube-controllers` v3.8.0 (df5ff96cd966) to docker registry (T236249)
  • 11:47 arturo: upload image `nginx-ingress-controller` v0.25.1 (0439eb3e11f1) to docker registry (T236249)

2019-10-24

  • 16:32 bstorm_: set the prod rsyslog config for kubernetes to false for Toolforge

2019-10-23

  • 20:00 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.47 (T233347)
  • 12:09 phamhi: Deployed toollabs-webservice 0.47 to buster-tools and stretch-tools (T233347)
  • 09:13 arturo: 9 tools-sgeexec nodes and 6 other related VMs are down because hypervisor is rebooting
  • 09:03 arturo: tools-sgebastion-08 is down because hypervisor is rebooting

2019-10-22

  • 16:56 bstorm_: drained tools-worker-1025.tools.eqiad.wmflabs which was malfunctioning
  • 09:25 arturo: created the `tools.eqiad1.wikimedia.cloud.` DNS zone

2019-10-21

  • 17:32 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.46

2019-10-18

  • 22:15 bd808: Rescheduled continuous jobs away from tools-sgeexec-0904 because of high system load
  • 22:09 bd808: Cleared error state of webgrid-generic@tools-sgewebgrid-generic-0901, webgrid-lighttpd@tools-sgewebgrid-lighttpd-09{12,15,19,20,26}
  • 21:29 bd808: Rescheduled all grid engine webservice jobs (T217815)

2019-10-16

  • 16:21 phamhi: Deployed toollabs-webservice 0.46 to buster-tools and stretch-tools (T218461)
  • 09:29 arturo: toolforge is recovered from the reboot of cloudvirt1029
  • 09:17 arturo: due to the reboot of cloudvirt1029, several sgeexec nodes (8) are offline, also sgewebgrid-lighttpd (8) and tools-worker (3) and the main toolforge proxy (tools-proxy-03)

2019-10-15

  • 17:10 phamhi: restart tools-worker-1035 because it is no longer responding

2019-10-14

  • 09:26 arturo: cleaned-up updatetools from tools-sge-services nodes (T229261)

2019-10-11

  • 19:52 bstorm_: restarted docker on tools-docker-builder after phamhi noticed the daemon had a routing issue (blank iptables)
  • 11:55 arturo: create tools-test-proxy-01 VM for testing T235059 and a puppet prefix for it
  • 10:53 arturo: added kubernetes-node_1.4.6-7_amd64.deb to buster-tools and buster-toolsbeta (aptly) for T235059
  • 10:51 arturo: added docker-engine_1.12.6-0~debian-jessie_amd64.deb to buster-tools and buster-toolsbeta (aptly) for T235059
  • 10:46 arturo: added logster_0.0.10-2~jessie1_all.deb to buster-tools and buster-toolsbeta (aptly) for T235059

2019-10-10

  • 02:33 bd808: Rebooting tools-sgewebgrid-lighttpd-0903. Instance hung.

2019-10-09

  • 22:52 jeh: removing test instances tools-sssd-sgeexec-test-[12] from SGE
  • 15:32 phamhi: drained tools-worker-1020/23/33/35/36/40 to rebalance the cluster
  • 14:46 phamhi: drained and cordoned tools-worker-1029 after status reset on reboot
  • 12:37 arturo: drain tools-worker-1038 to rebalance load in the k8s cluster
  • 12:35 arturo: uncordon tools-worker-1029 (was disabled for unknown reasons)
  • 12:33 arturo: drain tools-worker-1010 to rebalance load
  • 10:33 arturo: several sgewebgrid-lighttpd nodes (9) not available because cloudvirt1013 is rebooting
  • 10:21 arturo: several worker nodes (7) not available because cloudvirt1012 is rebooting
  • 10:08 arturo: several worker nodes (6) not available because cloudvirt1009 is rebooting
  • 09:59 arturo: several worker nodes (5) not available because cloudvirt1008 is rebooting

2019-10-08

  • 19:40 bstorm_: drained tools-worker-1007/8 to rebalance the cluster
  • 19:34 bstorm_: drained tools-worker-1009 and then 1014 for rebalancing
  • 19:27 bstorm_: drained tools-worker-1005 for rebalancing (and put these back in service as I went)
  • 19:24 bstorm_: drained tools-worker-1003 and 1009 for rebalancing
  • 15:41 arturo: deleted VM instance tools-sgebastion-0test. No longer in use.

2019-10-07

  • 20:17 bd808: Dropped backlog of messages for delivery to tools.usrd-tools
  • 20:16 bd808: Dropped backlog of messages for delivery to tools.mix-n-match
  • 20:13 bd808: Dropped backlog of frozen messages for delivery (240 dropped)
  • 19:25 bstorm_: deleted tools-puppetmaster-02
  • 19:20 Krenair: reboot tools-k8s-master-01 due to nfs stale issue
  • 19:18 Krenair: reboot tools-paws-worker-1006 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1040 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1039 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1038 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1037 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1036 due to nfs stale issue
  • 19:16 phamhi: reboot tools-worker-1035 due to nfs stale issue
  • 19:15 phamhi: reboot tools-worker-1034 due to nfs stale issue
  • 19:15 phamhi: reboot tools-worker-1033 due to nfs stale issue
  • 19:15 phamhi: reboot tools-worker-1032 due to nfs stale issue
  • 19:15 phamhi: reboot tools-worker-1031 due to nfs stale issue
  • 19:15 phamhi: reboot tools-worker-1030 due to nfs stale issue
  • 19:10 Krenair: reboot tools-puppetmaster-02 due to nfs stale issue
  • 19:09 Krenair: reboot tools-sgebastion-0test due to nfs stale issue
  • 19:08 Krenair: reboot tools-sgebastion-09 due to nfs stale issue
  • 19:08 Krenair: reboot tools-sge-services-04 due to nfs stale issue
  • 19:07 Krenair: reboot tools-paws-worker-1002 due to nfs stale issue
  • 19:06 Krenair: reboot tools-mail-02 due to nfs stale issue
  • 19:06 Krenair: reboot tools-docker-registry-03 due to nfs stale issue
  • 19:04 Krenair: reboot tools-worker-1029 due to nfs stale issue
  • 19:00 Krenair: reboot tools-static-12 tools-docker-registry-04 and tools-clushmaster-02 due to NFS stale issue
  • 18:55 phamhi: reboot tools-worker-1028 due to nfs stale issue
  • 18:55 phamhi: reboot tools-worker-1027 due to nfs stale issue
  • 18:55 phamhi: reboot tools-worker-1026 due to nfs stale issue
  • 18:55 phamhi: reboot tools-worker-1025 due to nfs stale issue
  • 18:47 phamhi: reboot tools-worker-1023 due to nfs stale issue
  • 18:47 phamhi: reboot tools-worker-1022 due to nfs stale issue
  • 18:46 phamhi: reboot tools-worker-1021 due to nfs stale issue
  • 18:46 phamhi: reboot tools-worker-1020 due to nfs stale issue
  • 18:46 phamhi: reboot tools-worker-1019 due to nfs stale issue
  • 18:46 phamhi: reboot tools-worker-1018 due to nfs stale issue
  • 18:34 phamhi: reboot tools-worker-1017 due to nfs stale issue
  • 18:34 phamhi: reboot tools-worker-1016 due to nfs stale issue
  • 18:32 phamhi: reboot tools-worker-1015 due to nfs stale issue
  • 18:32 phamhi: reboot tools-worker-1014 due to nfs stale issue
  • 18:23 phamhi: reboot tools-worker-1013 due to nfs stale issue
  • 18:21 phamhi: reboot tools-worker-1012 due to nfs stale issue
  • 18:12 phamhi: reboot tools-worker-1011 due to nfs stale issue
  • 18:12 phamhi: reboot tools-worker-1010 due to nfs stale issue
  • 18:08 phamhi: reboot tools-worker-1009 due to nfs stale issue
  • 18:07 phamhi: reboot tools-worker-1008 due to nfs stale issue
  • 17:58 phamhi: reboot tools-worker-1007 due to nfs stale issue
  • 17:57 phamhi: reboot tools-worker-1006 due to nfs stale issue
  • 17:47 phamhi: reboot tools-worker-1005 due to nfs stale issue
  • 17:47 phamhi: reboot tools-worker-1004 due to nfs stale issue
  • 17:43 phamhi: reboot tools-worker-1002.tools.eqiad.wmflabs due to nfs stale issue
  • 17:35 phamhi: drained and uncordoned tools-worker-100[1-5]
  • 17:32 bstorm_: reboot tools-sgewebgrid-lighttpd-0912
  • 17:30 bstorm_: reboot tools-sgewebgrid-lighttpd-0923/24/08
  • 17:01 bstorm_: rebooting tools-sgegrid-master and tools-sgegrid-shadow 😭
  • 16:58 bstorm_: rebooting tools-sgewebgrid-lighttpd-0902/4/6/7/8/19
  • 16:53 bstorm_: rebooting tools-sgewebgrid-generic-0902/4
  • 16:50 bstorm_: rebooting tools-sgeexec-0915/18/19/23/26
  • 16:49 bstorm_: rebooting tools-sgeexec-0901 and tools-sgeexec-0909/10/11
  • 16:46 bd808: `sudo shutdown -r now` for tools-sgebastion-08
  • 16:41 bstorm_: reboot tools-sgebastion-07
  • 16:39 bd808: `sudo service nslcd restart` on tools-sgebastion-08

2019-10-04

  • 21:43 bd808: `sudo exec-manage repool tools-sgeexec-0923.tools.eqiad.wmflabs`
  • 21:26 bd808: Rebooting tools-sgeexec-0923 after lots of messing about with a broken update-initramfs build
  • 20:35 bd808: Manually running `/usr/bin/python3 /usr/bin/unattended-upgrade` on tools-sgeexec-0923
  • 20:33 bd808: Killed 2 /usr/bin/unattended-upgrade procs on tools-sgeexec-0923 that seemed stuck
  • 13:33 arturo: remove /etc/init.d/rsyslog on tools-worker-XXXX nodes so the rsyslog deb prerm script doesn't prevent the package from being updated

2019-10-03

  • 13:05 arturo: delete servers tools-sssd-sgeexec-test-[1,2], no longer required

2019-09-27

  • 16:59 bd808: Set "profile::rsyslog::kafka_shipper::kafka_brokers: []" in tools-elastic prefix puppet
  • 00:40 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0927

2019-09-25

  • 19:08 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1021

2019-09-23

  • 16:58 bstorm_: deployed tools-manifest 0.20 and restarted webservicemonitor
  • 06:01 bd808: Restarted maintain-dbusers process on labstore1004. (T233530)

2019-09-12

  • 20:48 phamhi: Deleted tools-puppetdb-01.tools as it is no longer in used

2019-09-11

  • 13:30 jeh: restart tools-sgeexec-0912

2019-09-09

  • 22:44 bstorm_: uncordoned tools-worker-1030 and tools-worker-1038

2019-09-06

  • 15:11 bd808: `sudo kill -9 10635` on tools-k8s-master-01 (T194859)

2019-09-05

  • 21:02 bd808: Enabled Puppet on tools-docker-registry-03 and forced puppet run (T232135)
  • 18:13 bd808: Disabled Puppet on tools-docker-registry-03 to investigate docker-registry issue (no phab task yet)

2019-09-01

  • 20:51 Reedy: `sudo service maintain-kubeusers restart` on tools-k8s-master-01

2019-08-30

  • 16:54 phamhi: restart maintain-kuberusers service in tools-k8s-master-01
  • 16:21 bstorm_: depooling tools-sgewebgrid-lighttpd-0923 to reboot -- high iowait likely from NFS mounts

2019-08-29

  • 22:18 bd808: Finished building new stretch Docker images for Toolforge Kubernetes use
  • 22:06 bd808: Starting process of building new stretch Docker images for Toolforge Kubernetes use
  • 22:05 bd808: Jessie Docker image rebuild complete
  • 21:31 bd808: Starting process of building new jessie Docker images for Toolforge Kubernetes use

2019-08-27

  • 19:10 bd808: Restarted maintain-kubeusers after complaint on irc. It was stuck in limbo again

2019-08-26

  • 21:48 bstorm_: repooled tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0902, tools-sgewebgrid-lighttpd-0903 and tools-sgeexec-0905

2019-08-18

  • 08:11 arturo: restart maintain-kuberusers service in tools-k8s-master-01

2019-08-17

  • 10:56 arturo: force-reboot tools-worker-1006. Is completely stuck

2019-08-15

  • 15:32 jeh: upgraded jobutils debian package to 1.38 T229551
  • 09:22 arturo: restart maintain-kubeusers service in tools-k8s-master-01 because some tools were missing their namespaces

2019-08-13

  • 22:00 bstorm_: truncated exim paniclog on tools-sgecron-01 because it was being spammy
  • 13:41 jeh: Set icingia downtime for toolschecker labs showmount T229448

2019-08-12

  • 16:08 phamhi: updated prometheus-node-exporter from 0.14.0~git20170523-1 to 0.17.0+ds-3 in tools-worker-[1030-1040] nodes (T230147)

2019-08-08

  • 19:26 jeh: restarting tools-sgewebgrid-lighttpd-0915 T230157

2019-08-07

  • 19:07 bd808: Disassociated SUL and Phabricator accounts from user Lophi (T229713)

2019-08-06

  • 16:18 arturo: add phamhi as user/projectadmin (T228942) and delete hpham
  • 15:59 arturo: add hpham as user/projectadmin (T228942)
  • 13:44 jeh: disabling puppet on tools-checker-03 while testing nginx timeouts T221301

2019-08-05

  • 22:49 bstorm_: launching tools-worker-1040
  • 20:36 andrewbogott: rebooting oom tools-worker-1026
  • 16:10 jeh: `tools-k8s-master-01: systemctl restart maintain-kubeusers` T229846
  • 09:39 arturo: `root@tools-checker-03:~# toolscheckerctl restart` again (T229787)
  • 09:30 arturo: `root@tools-checker-03:~# toolscheckerctl restart` (T229787)

2019-08-02

  • 14:00 andrewbogott_: rebooting tools-worker-1022 as it is unresponsive

2019-07-31

  • 18:07 bstorm_: drained tools-worker-1015/05/03/17 to rebalance load
  • 17:41 bstorm_: drained tools-worker-1025 and 1026 to rebalance load
  • 17:32 bstorm_: drained tools-worker-1028 to rebalance load
  • 17:29 bstorm_: drained tools-worker-1008 to rebalance load
  • 17:23 bstorm_: drained tools-worker-1021 to rebalance load
  • 17:17 bstorm_: drained tools-worker-1007 to rebalance load
  • 17:07 bstorm_: drained tools-worker-1004 to rebalance load
  • 16:27 andrewbogott: moving tools-static-12 to cloudvirt1018
  • 15:33 bstorm_: T228573 spinning up 5 worker nodes for kubernetes cluster (tools-worker-1035-9)

2019-07-27

2019-07-26

  • 17:39 bstorm_: restarted maintain-kubeusers because it was suspiciously tardy and quiet
  • 17:14 bstorm_: drained tools-worker-1013.tools.eqiad.wmflabs to rebalance load
  • 17:09 bstorm_: draining tools-worker-1020.tools.eqiad.wmflabs to rebalance load
  • 16:32 bstorm_: created tools-worker-1034 - T228573
  • 15:57 bstorm_: created tools-worker-1032 and 1033 - T228573
  • 15:55 bstorm_: created tools-worker-1031 - T228573

2019-07-25

  • 22:01 bstorm_: T228573 created tools-worker-1030
  • 21:22 jeh: rebooting tools-worker-1016 unresponsive

2019-07-24

  • 10:14 arturo: reallocating tools-puppetmaster-01 from cloudvirt1027 to cloudvirt1028 (T227539)
  • 10:12 arturo: reallocating tools-docker-registry-04 from cloudvirt1027 to cloudvirt1028 (T227539)

2019-07-22

  • 18:39 bstorm_: repooled tools-sgeexec-0905 after reboot
  • 18:33 bstorm_: depooled tools-sgeexec-0905 because it's acting kind of weird and not responding to prometheus
  • 18:32 bstorm_: repooled tools-sgewebgrid-lighttpd-0902 after restarting the grid-exec service
  • 18:28 bstorm_: depooled tools-sgewebgrid-lighttpd-0902 to find out why it is behaving weird
  • 17:55 bstorm_: draining tools-worker-1023 since it is having issues
  • 17:38 bstorm_: Adding the prometheus servers to the ferm rules via wikitech hiera for kubelet stats T228573

2019-07-20

  • 19:52 andrewbogott: rebooting tools-worker-1023

2019-07-17

  • 20:23 andrewbogott: migrating tools-sgegrid-shadow to cloudvirt1014

2019-07-15

  • 14:50 bstorm_: cleared error state from tools-sgeexec-0911 which went offline after error from job 5190035

2019-06-25

  • 09:30 arturo: detected puppet issue in all VMs: T226480

2019-06-24

  • 17:42 andrewbogott: moving tools-sgeexec-0905 to cloudvirt1015

2019-06-17

  • 14:07 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1015
  • 13:59 andrewbogott: moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: T220853 )

2019-06-11

  • 18:03 bstorm_: deleted anomalous kubernetes node tools-worker-1019.eqiad.wmflabs

2019-06-05

  • 18:33 andrewbogott: repooled tools-sgeexec-0921 and tools-sgeexec-0929
  • 18:16 andrewbogott: depooling and moving tools-sgeexec-0921 and tools-sgeexec-0929

2019-05-30

  • 13:01 arturo: uncordon/repool tools-worker-1001/2/3. They should be fine now. I'm only leaving 1029 cordoned for testing purposes
  • 13:01 arturo: reboot tools-woker-1003 to cleanup sssd config and let nslcd/nscd start freshly
  • 12:47 arturo: reboot tools-woker-1002 to cleanup sssd config and let nslcd/nscd start freshly
  • 12:42 arturo: reboot tools-woker-1001 to cleanup sssd config and let nslcd/nscd start freshly
  • 12:35 arturo: enable puppet in tools-worker nodes
  • 12:29 arturo: switch hiera setting back to classic/sudoldap for tools-worker because T224651 (T224558)
  • 12:25 arturo: cordon/drain tools-worker-1002 because T224651 and T224651
  • 12:23 arturo: cordon/drain tools-worker-1001 because T224651 and T224651
  • 12:22 arturo: cordon/drain tools-worker-1029 because T224651 and T224651
  • 12:20 arturo: cordon/drain tools-worker-1003 because T224651 and T224651
  • 11:59 arturo: T224558 repool tools-worker-1003 (using sssd/sudo now!)
  • 11:23 arturo: T224558 depool tools-worker-1003
  • 10:48 arturo: T224558 drop/build a VM for tools-worker-1002. It didn't like the sssd/sudo change :-(
  • 10:33 arturo: T224558 switch tools-worker-1002 to sssd/sudo. Includes drain/depool/reboot/repool
  • 10:28 arturo: T224558 use hiera config in prefix tools-worker for sssd/sudo
  • 10:27 arturo: T224558 switch tools-worker-1001 to sssd/sudo. Includes drain/depool/reboot/repool
  • 10:09 arturo: T224558 disable puppet in all tools-worker- nodes
  • 10:01 arturo: T224558 add tools-worker-1029 to the nodes pool of k8s
  • 09:58 arturo: T224558 reboot tools-worker-1029 after puppet changes for sssd/sudo in jessie

2019-05-29

  • 11:13 arturo: briefly tested some sssd config changes in tools-sgebastion-09
  • 10:13 arturo: enroll the tools-worker-1029 VM into toolforge k8s, but leave it cordoned for sssd testing purposes (T221225)
  • 10:12 arturo: re-create the tools-worker-1001 VM, already enrolled into toolforge k8s
  • 09:34 arturo: delete tools-worker-1001, it was totally malfunctioning

2019-05-28

  • 18:15 arturo: T221225 for the record, tools-worker-1001 is not working after trying with sssd
  • 18:13 arturo: T221225 created tools-worker-1029 to test sssd/sudo stuff
  • 17:49 arturo: T221225 repool tools-worker-1002 (using nscd/nslcd and sudoldap)
  • 17:44 arturo: T221225 back to classic/ldap hiera config in the tools-worker puppet prefix
  • 17:35 arturo: T221225 hard reboot tools-worker-1001 again
  • 17:27 arturo: T221225 hard reboot tools-worker-1001
  • 17:12 arturo: T221225 depool & switch to sssd/sudo & reboot & repool tools-worker-1002
  • 17:09 arturo: T221225 depool & switch to sssd/sudo & reboot & repool tools-worker-1001
  • 17:08 arturo: T221225 switch to sssd/sudo in puppet prefix for tools-worker
  • 13:04 arturo: T221225 depool and rebooted tools-worker-1001 in preparation for sssd migration
  • 12:39 arturo: T221225 disable puppet in all tools-worker nodes in preparation for sssd
  • 12:32 arturo: drop the tools-bastion puppet prefix, unused
  • 12:31 arturo: T221225 set sssd/sudo in the hiera config for the tools-checker prefix, and reboot tools-checker-03
  • 12:27 arturo: T221225 set sssd/sudo in the hiera config for the tools-docker-registry prefix, and reboot tools-docker-registry-[03-04]
  • 12:16 arturo: T221225 set sssd/sudo in the hiera config for the tools-sgebastion prefix, and reboot tools-sgebastion-07/08
  • 11:26 arturo: merged change to the sudo module to allow sssd transition

2019-05-27

  • 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90% (on tools-sgebastion-08)
  • 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90%

2019-05-21

  • 12:35 arturo: T223992 rebooting tools-redis-1002

2019-05-20

  • 11:25 arturo: T223332 enable puppet agent in tools-k8s-master and tools-docker-registry nodes and deploy new SSL cert
  • 10:53 arturo: T223332 disable puppet agent in tools-k8s-master and tools-docker-registry nodes

2019-05-18

  • 11:13 chicocvenancio: PAWS update helm chart to point to new singleuser image (T217908)
  • 09:06 bd808: Rebuilding all stretch docker images to pick up toollabs-webservice 0.45

2019-05-17

  • 17:36 bd808: Rebuilding all docker images to pick up toollabs-webservice 0.45
  • 17:35 bd808: Deployed toollabs-webservice 0.45 (python 3.5 and nodejs 10 containers)

2019-05-16

  • 11:22 chicocvenancio: PAWS: restart hub to get new configured announcement
  • 11:05 chicocvenancio: PAWS: change confimap to reference WMHACK 2019 as busiest time

2019-05-15

  • 16:20 arturo: T223148 repool both tools-sgeexec-0921 and -0929
  • 15:32 arturo: T223148 depool tools-sgeexec-0921 and move to cloudvirt1014
  • 15:32 arturo: T223148 depool tools-sgeexec-0920 and move to cloudvirt1014
  • 12:29 arturo: T223148 repool both tools-sgeexec-09[37,39]
  • 12:13 arturo: T223148 depool tools-sgeexec-0937 and move to cloudvirt1008
  • 12:13 arturo: T223148 depool tools-sgeexec-0939 and move to cloudvirt1007
  • 11:34 arturo: T223148 repool tools-sgeexec-0940
  • 11:20 arturo: T223148 depool tools-sgeexec-0940 and move to cloudvirt1006
  • 11:11 arturo: T223148 repool tools-sgeexec-0941
  • 10:46 arturo: T223148 depool tools-sgeexec-0941 and move to cloudvirt1005
  • 09:44 arturo: T223148 repool tools-sgeexec-0901
  • 09:00 arturo: T223148 depool tools-sgeexec-0901 and reallocate to cloudvirt1004

2019-05-14

  • 17:12 arturo: T223148 repool tools-sgeexec-0920
  • 16:37 arturo: T223148 depool tools-sgeexec-0920 and reallocate to cloudvirt1003
  • 16:36 arturo: T223148 repool tools-sgeexec-0911
  • 15:56 arturo: T223148 depool tools-sgeexec-0911 and reallocate to cloudvirt1003
  • 15:52 arturo: T223148 repool tools-sgeexec-0909
  • 15:24 arturo: T223148 depool tools-sgeexec-0909 and reallocate to cloudvirt1002
  • 15:24 arturo: T223148 last SAL entry is bogus, please ignore (depool tools-worker-1009)
  • 15:23 arturo: T223148 depool tools-worker-1009
  • 15:13 arturo: T223148 repool tools-worker-1023
  • 13:16 arturo: T223148 repool tools-sgeexec-0942
  • 13:03 arturo: T223148 repool tools-sgewebgrid-generic-0904
  • 12:58 arturo: T223148 reallocating tools-worker-1023 to cloudvirt1001
  • 12:56 arturo: T223148 depool tools-worker-1023
  • 12:52 arturo: T223148 reallocating tools-sgeexec-0942 to cloudvirt1001
  • 12:50 arturo: T223148 depool tools-sgeexec-0942
  • 12:49 arturo: T223148 reallocating tools-sgewebgrid-generic-0904 to cloudvirt1001
  • 12:43 arturo: T223148 depool tools-sgewebgrid-generic-0904

2019-05-13

  • 08:15 zhuyifei1999_: `truncate -s 0 /var/log/exim4/paniclog` on tools-sgecron-01.tools.eqiad.wmflabs & tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs

2019-05-07

  • 14:38 arturo: T222718 uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned
  • 14:31 arturo: T222718 reboot tools-worker-1009 and 1022 after being drained
  • 14:28 arturo: k8s drain tools-worker-1009 and 1022
  • 11:46 arturo: T219362 enable puppet in tools-redis servers and use the new puppet role
  • 11:33 arturo: T219362 disable puppet in tools-reds servers for puppet code cleanup
  • 11:12 arturo: T219362 drop the `tools-services` puppet prefix (we are actually using `tools-sgeservices`)
  • 11:10 arturo: T219362 enable puppet in tools-static servers and use new puppet role
  • 11:01 arturo: T219362 disable puppet in tools-static servers for puppet code cleanup
  • 10:16 arturo: T219362 drop the `tools-webgrid-lighttpd` puppet prefix
  • 10:14 arturo: T219362 drop the `tools-webgrid-generic` puppet prefix
  • 10:06 arturo: T219362 drop the `tools-exec-1` puppet prefix

2019-05-06

  • 11:34 arturo: T221225 reenable puppet
  • 10:53 arturo: T221225 disable puppet in all toolforge servers for testing sssd patch (puppetmaster livehack)

2019-05-03

  • 09:43 arturo: fixed puppet in tools-puppetdb-01 too
  • 09:39 arturo: puppet should be now fine across toolforge (except tools-puppetdb-01 which is WIP I think)
  • 09:37 arturo: fix puppet in tools-elastic-03, archived jessie repos, weird rsyslog-kafka package situation
  • 09:33 arturo: fix puppet in tools-elastic-02, archived jessie repos, weird rsyslog-kafka package situation
  • 09:24 arturo: fix puppet in tools-elastic-01, archived jessie repos, weird rsyslog-kafka package situation
  • 09:18 arturo: solve a weird apt situation in tools-puppetmaster-01 regarding the rsyslog-kafka package (puppet agent was failing)
  • 09:16 arturo: solve a weird apt situation in tools-worker-1028 regarding the rsyslog-kafka package

2019-04-30

  • 12:50 arturo: enable puppet in all servers T221225
  • 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd (T221225)
  • 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd
  • 11:07 arturo: T221225 disable puppet in toolforge
  • 10:56 arturo: T221225 create tools-sgebastion-0test for more sssd tests

2019-04-29

  • 11:22 arturo: T221225 re-enable puppet agent in all toolforge servers
  • 10:27 arturo: T221225 reboot tool-sgebastion-09 for testing sssd
  • 10:21 arturo: disable puppet in all servers to livehack tools-puppetmaster-01 to test T221225
  • 08:29 arturo: cleanup disk in tools-sgebastion-09, was full of debug logs and unused apt packages

2019-04-26

  • 12:20 andrewbogott: rescheduling every pod everywhere
  • 12:18 andrewbogott: rescheduling all pods on tools-worker-1023.tools.eqiad.wmflabs

2019-04-25

  • 12:49 arturo: T221225 using `profile::ldap::client::labs::client_stack: sssd` in horizon for tools-sgebastion-09 (testing)
  • 11:43 arturo: T221793 removing prometheus crontab and letting puppet agent re-create it again to resolve staleness

2019-04-24

  • 12:54 arturo: puppet broken, fixing right now
  • 09:18 arturo: T221225 reallocating tools-sgebastion-09 to cloudvirt1008

2019-04-23

  • 15:26 arturo: T221225 rebooting tools-sgebastion-08 to cleanup sssd
  • 15:19 arturo: T221225 creating tools-sgebastion-09 for testing sssd stuff
  • 13:06 arturo: T221225 use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix, again. Rollback again.
  • 12:57 arturo: T221225 use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix, try again with sssd in the bastions, reboot them
  • 10:28 arturo: T221225 use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix
  • 10:27 arturo: T221225 rebooting tools-sgebastion-07 to clean sssd confiuration
  • 10:16 arturo: T221225 disable puppet in tools-sgebastion-08 for sssd testing
  • 09:49 arturo: T221225 run puppet agent in the bastions and reboot them with sssd
  • 09:43 arturo: T221225 use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix
  • 09:41 arturo: T221225 disable puppet agent in the bastions

2019-04-17

  • 12:09 arturo: T221225 rebooting bastions to clean sssd. We are back to nscd/nslcd until we figure out what's wrong here
  • 11:59 arturo: T221205 sssd was deployed successfully into all webgrid nodes
  • 11:39 arturo: deploy sssd to tools-sge-services-03/04 (includes reboot)
  • 11:31 arturo: reboot bastions for sssd deployment
  • 11:30 arturo: deploy sssd to bastions
  • 11:24 arturo: disable puppet in bastions to deploy sssd
  • 09:52 arturo: T221205 tools-sgewebgrid-lighttpd-0915 requires some manual intervention because issues in the dpkg database prevents deleting nscd/nslcd packages
  • 09:45 arturo: T221205 tools-sgewebgrid-lighttpd-0913 requires some manual intervention because unconfigured packages prevents a clean puppet agent run
  • 09:12 arturo: T221205 start deploying sssd to sgewebgrid nodes
  • 09:00 arturo: T221205 add `profile::ldap::client::labs::client_stack: sssd` in horizon for the puppet prefixes `tools-sgewebgrid-lighttpd` and `tools-sgewebgrid-generic`
  • 08:57 arturo: T221205 disable puppet in all tools-sgewebgrid-* nodes

2019-04-16

  • 20:49 chicocvenancio: change paws announcement in configmap hub-config back to a welcome message
  • 17:15 chicocvenancio: add paws outage announcement in configmap hub-config
  • 17:00 andrewbogott: moving tools-k8s-master-01 to eqiad1-r

2019-04-15

  • 18:50 andrewbogott: moving tools-elastic-01 to cloudvirt1008 to make spreadcheck happy
  • 15:01 andrewbogott: moving tools-redis-1001 to eqiad1-r

2019-04-14

  • 16:23 andrewbogott: moved all tools-worker nodes off of cloudvirt1015 and uncordoned them

2019-04-13

  • 21:09 bstorm_: Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for T220853
  • 20:36 bstorm_: moving tools-elastic-02 to cloudvirt1009 for T220853
  • 19:58 bstorm_: started migrating tools-k8s-etcd-03 to cloudvirt1012 T220853
  • 19:51 bstorm_: started migrating tools-flannel-etcd-02 to cloudvirt1013 T220853

2019-04-11

  • 22:38 andrewbogott: moving tools-paws-worker-1005 to cloudvirt1009 to make spreadcheck happier
  • 21:49 bd808: Re-enabled puppet on tools-elastic-02 and forced puppet run
  • 21:44 andrewbogott: moving tools-mail-02 to eqiad1-r
  • 20:56 andrewbogott: shutting down tools-logs-02 — seems unused
  • 19:44 bd808: Disabled puppet on tools-elastic-02 and set to 1-node master
  • 19:34 andrewbogott: moving tools-puppetmaster-01 to eqiad1-r
  • 15:40 andrewbogott: moving tools-redis-1002 to eqiad1-r
  • 13:52 andrewbogott: moving tools-prometheus-01 and tools-elastic-01 to eqiad1-r
  • 12:01 arturo: T151704 deploying oidentd
  • 11:54 arturo: disable puppet in all hosts to deploy oidentd
  • 02:33 andrewbogott: tools-paws-worker-1005, tools-paws-worker-1006 to eqiad1-r
  • 00:03 andrewbogott: tools-paws-worker-1002, tools-paws-worker-1003 to eqiad1-r

2019-04-10

  • 23:58 andrewbogott: moving tools-clushmaster-02, tools-elastic-03 and tools-paws-worker-1001 to eqiad1-r
  • 18:52 bstorm_: depooled and rebooted tools-sgeexec-0929 because systemd was in a weird state
  • 18:46 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0913 because high load was caused by ancient lsof processes
  • 14:49 bstorm_: cleared E state from 5 queues
  • 13:06 arturo: T218126 hard reboot tools-sgeexec-0906
  • 12:31 arturo: T218126 hard reboot tools-sgeexec-0926
  • 12:27 arturo: T218126 hard reboot tools-sgeexec-0925
  • 12:06 arturo: T218126 hard reboot tools-sgeexec-0901
  • 11:55 arturo: T218126 hard reboot tools-sgeexec-0924
  • 11:47 arturo: T218126 hard reboot tools-sgeexec-0921
  • 11:23 arturo: T218126 hard reboot tools-sgeexec-0940
  • 11:03 arturo: T218126 hard reboot tools-sgeexec-0928
  • 10:49 arturo: T218126 hard reboot tools-sgeexec-0923
  • 10:43 arturo: T218126 hard reboot tools-sgeexec-0915
  • 10:27 arturo: T218126 hard reboot tools-sgeexec-0935
  • 10:19 arturo: T218126 hard reboot tools-sgeexec-0914
  • 10:02 arturo: T218126 hard reboot tools-sgeexec-0907
  • 09:41 arturo: T218126 hard reboot tools-sgeexec-0918
  • 09:27 arturo: T218126 hard reboot tools-sgeexec-0932
  • 09:26 arturo: T218216 hard reboot tools-sgeexec-0932
  • 09:04 arturo: T218216 add `profile::ldap::client::labs::client_stack: sssd` to prefix puppet for sge-exec nodes
  • 09:03 arturo: T218216 do a controlled rollover of sssd, depooling sgeexec nodes, reboot and repool
  • 08:39 arturo: T218216 disable puppet in all tools-sgeexec-XXXX nodes for controlled sssd rollout
  • 00:32 andrewbogott: migrating tools-worker-1022, 1023, 1025, 1026 to eqiad1-r

2019-04-09

  • 22:04 bstorm_: added the new region on port 80 to the elasticsearch security group for stashbot
  • 21:16 andrewbogott: moving tools-flannel-etcd-03 to eqiad1-r
  • 20:43 andrewbogott: moving tools-worker-1018, 1019, 1020, 1021 to eqiad1-r
  • 20:04 andrewbogott: moving tools-k8s-etcd-03 to eqiad1-r
  • 19:54 andrewbogott: moving tools-flannel-etcd-02 to eqiad1-r
  • 18:36 andrewbogott: moving tools-worker-1016, tools-worker-1017 to eqiad1-r
  • 18:05 andrewbogott: migrating tools-k8s-etcd-02 to eqiad1-r
  • 18:00 andrewbogott: migrating tools-flannel-etcd-01 to eqiad1-r
  • 17:36 andrewbogott: moving tools-worker-1014, tools-worker-1015 to eqiad1-r
  • 17:05 andrewbogott: migrating tools-k8s-etcd-01 to eqiad1-r
  • 15:56 andrewbogott: moving tools-worker-1012, tools-worker-1013 to eqiad1-r
  • 14:56 bstorm_: cleared 4 queues on gridengine of E status (ldap again)
  • 14:07 andrewbogott: moving tools-worker-1010, tools-worker-1011, tools-worker-1001 to eqiad1-r
  • 03:48 andrewbogott: moving tools-worker-1008 and tools-worker-1009 to eqiad1-r
  • 02:07 bstorm_: reloaded ferm on tools-flannel-etcd-0[1-3] to get the k8s node moves to register

2019-04-08

  • 22:36 andrewbogott: moving tools-worker-1006 and tools-worker-1007 to eqiad1-r
  • 20:03 andrewbogott: moving tools-worker-1003 and tools-worker-1004 to eqiad1-r

2019-04-07

  • 16:54 zhuyifei1999_: tools-sgeexec-0928 unresponsive since around 22 UTC. No data on Graphite. Can't ssh in even as root. Hard rebooting via Horizon
  • 01:06 bstorm_: cleared E state from 6 queues

2019-04-05

  • 15:44 bstorm_: cleared E state from two exec queues

2019-04-04

  • 21:21 bd808: Uncordoned tools-worker-1013.tools.eqiad.wmflabs after reboot and forced puppet run
  • 20:53 bd808: Rebooting tools-worker-1013
  • 20:50 bd808: Draining tools-worker-1013.tools.eqiad.wmflabs
  • 20:29 bd808: Released floating IP and deleted instance tools-checker-01 via Horizon
  • 20:28 bd808: Shutdown tools-checker-01 via Horizon
  • 20:17 bd808: Repooled tools-webgrid-lighttpd-0906 after reboot, apt-get dist-upgrade, and forced puppet run
  • 20:13 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0906 via Horizon
  • 20:09 bd808: Repooled tools-webgrid-lighttpd-0912 after reboot, apt-get dist-upgrade, and forced puppet run
  • 20:05 bd808: Depooled and rebooted tools-sgewebgrid-lighttpd-0912
  • 20:05 bstorm_: rebooted tools-webgrid-lighttpd-0912
  • 20:03 bstorm_: depooled tools-webgrid-lighttpd-0912
  • 19:59 bstorm_: depooling and rebooting tools-webgrid-lighttpd-0906
  • 19:43 bd808: Repooled tools-sgewebgrid-lighttpd-0926 after reboot, apt-get dist-update, and forced puppet run
  • 19:36 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0926 via Horizon
  • 19:30 bd808: Rebooting tools-sgewebgrid-lighttpd-0926
  • 19:28 bd808: Depooled tools-sgewebgrid-lighttpd-0926
  • 19:13 bstorm_: cleared E state from 7 queues
  • 17:32 andrewbogott: moving tools-static-12 to cloudvirt1023 to keep the two static nodes off the same host

2019-04-03

  • 11:22 arturo: puppet breakage in due to me introducing openstack-mitaka-jessie repo by mistake. Cleaning up already

2019-04-02

  • 12:11 arturo: icinga downtime toolschecker for 1 month T219243
  • 03:55 bd808: Added etcd service group to tools-k8s-etcd-* (T219243)

2019-04-01

  • 19:44 bd808: Deleted tools-checker-02 via Horizon (T219243)
  • 19:43 bd808: Shutdown tools-checker-02 via Horizon (T219243)
  • 16:53 bstorm_: cleared E state on 6 grid queues
  • 14:54 andrewbogott: moving tools-static-12 to eqiad1-r (for real this time maybe)

2019-03-29

  • 21:13 bstorm_: depooled tools-sgewebgrid-generic-0903 because of some stuck jobs and odd load characteristics
  • 21:09 bd808: Updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ on tools-puppetmaster-01 (T219243)
  • 20:48 bd808: Using root console to fix broken initial puppet run on tools-checker-03.
  • 20:32 bd808: Creating tools-checker-03 with role::wmcs::toolforge::checker (T219243)
  • 20:24 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ to tools-puppetmaster-01 for testing (T219243)
  • 20:22 bd808: Disabled puppet on tools-checker-0{1,2} to make testing new role::wmcs::toolforge::checker easier (T219243)
  • 17:25 bd808: Cleared the "Eqw" state of 44 jobs with `qstat -u '*' | grep Eqw | awk '{print $1;}' | xargs -L1 sudo qmod -cj` on tools-sgegrid-master
  • 17:16 andrewbogott: aborted move of tools-static-12; will wait until tomorrow and give DNS caches more time to update
  • 17:11 bd808: Restarted nginx on tools-static-13
  • 16:53 andrewbogott: moving tools-static-12 to eqiad1-r
  • 16:49 bstorm_: cleared E state from 21 queues
  • 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
  • 13:54 andrewbogott: moving tools-static-13 to eqiad1-r

2019-03-28

  • 01:00 bstorm_: cleared error states from two queues
  • 00:23 bstorm_: T216060 created tools-sgewebgrid-generic-0901...again!

2019-03-27

  • 23:35 bstorm_: rebooted tools-paws-master-01 for NFS issue T219460
  • 14:45 bstorm_: cleared several "E" state queues
  • 12:26 gtirloni: truncated exim4/paniclog on tools-sgewebgrid-lighttpd-0921
  • 12:25 gtirloni: truncated exim4/paniclog on tools-sgecron-01
  • 12:15 arturo: T218126 `aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2'` (and 1)

2019-03-26

  • 22:00 gtirloni: downtimed toolschecker
  • 17:31 arturo: T218126 create VM instances tools-sssd-sgeexec-test-[12]
  • 00:26 bd808: Deleted DNS record for login-trusty.tools.wmflabs.org
  • 00:26 bd808: Deleted DNS record for trusty-dev.tools.wmflabs.org

2019-03-25

  • 21:21 bd808: All Trusty grid engine hosts shutdown and deleted (T217152)
  • {{safesubst:SAL entry|1=21:19 bd808: Deleted tools-grid-{master,shadow} (T217152)}}
  • 21:18 bd808: Deleted tools-webgrid-lighttpd-14* (T217152)
  • 20:55 bstorm_: reboot tools-sgewebgrid-generic-0903 to clear up some issues
  • 20:52 bstorm_: rebooting tools-package-builder-02 due to lots of hung /usr/bin/lsof +c 15 -nXd DEL processes
  • 20:51 bd808: Deleted tools-webgrid-generic-14* (T217152)
  • 20:49 bd808: Deleted tools-exec-143* (T217152)
  • 20:49 bd808: Deleted tools-exec-142* (T217152)
  • 20:48 bd808: Deleted tools-exec-141* (T217152)
  • 20:47 bd808: Deleted tools-exec-140* (T217152)
  • 20:43 bd808: Deleted tools-cron-01 (T217152)
  • 20:42 bd808: Deleted tools-bastion-0{2,3} (T217152)
  • 20:35 bstorm_: rebooted tools-worker-1025 and tools-worker-1021
  • 19:59 bd808: Shutdown tools-exec-143* (T217152)
  • 19:51 bd808: Shutdown tools-exec-142* (T217152)
  • 19:47 bstorm_: depooling tools-worker-1025.tools.eqiad.wmflabs because it's not responding and showing insane load
  • 19:33 bd808: Shutdown tools-exec-141* (T217152)
  • 19:31 bd808: Shutdown tools-bastion-0{2,3} (T217152)
  • 19:19 bd808: Shutdown tools-exec-140* (T217152)
  • 19:12 bd808: Shutdown tools-webgrid-generic-14* (T217152)
  • 19:11 bd808: Shutdown tools-webgrid-lighttpd-14* (T217152)
  • 18:53 bd808: Shutdown tools-grid-master (T217152)
  • 18:53 bd808: Shutdown tools-grid-shadow (T217152)
  • 18:49 bd808: All jobs still running on the Trusty job grid force deleted.
  • 18:46 bd808: All Trusty job grid queues marked as disabled. This should stop all new Trusty job submissions.
  • 18:43 arturo: icinga downtime tools-checker for 24h due to trusty grid shutdown
  • 18:39 bd808: Shutdown tools-cron-01.tools.eqiad.wmflabs (T217152)
  • 15:27 bd808: Copied all crontab files still on tools-cron-01 to tool's $HOME/crontab.trusty.save
  • 02:34 bd808: Disassociated floating IPs and deleted shutdown Trusty grid nodes tools-exec-14{33,34,35,36,37,38,39,40,41,42} (T217152)
  • 02:26 bd808: Deleted shutdown Trusty grid nodes tools-webgrid-lighttpd-14{20,21,22,24,25,26,27,28} (T217152)

2019-03-22

  • 17:16 andrewbogott: switching all instances to use ldap-ro.eqiad.wikimedia.org as both primary and secondary ldap server
  • 16:12 bstorm_: cleared errored out stretch grid queues
  • 15:56 bd808: Rebooting tools-static-12
  • 03:09 bstorm_: T217280 depooled and rebooted 15 other nodes. Entire stretch grid is in a good state for now.
  • 02:31 bstorm_: T217280 depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening
  • 02:09 bstorm_: T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0924
  • 00:39 bstorm_: T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0902

2019-03-21

  • 23:28 bstorm_: T217280 depooled, reloaded and repooled tools-sgeexec-0938
  • 21:53 bstorm_: T217280 rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling
  • 21:51 bstorm_: T217280 rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling
  • 21:26 bstorm_: T217280 cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related

2019-03-18

  • 18:43 bd808: Rebooting tools-static-12
  • 18:42 chicocvenancio: PAWS: 3 nodes still in not ready state, `worker-10(01|07|10)` all else working
  • 18:41 chicocvenancio: PAWS: deleting pods stuck in Unknown state with ` --grace-period=0 --force`
  • 18:40 andrewbogott: rebooting tools-static-13 in hopes of fixing some nfs mounts
  • 18:25 chicocvenancio: removing postStart hook for PWB update and restarting hub while gerrit.wikimedia.com is down

2019-03-17

  • 23:41 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497210/ as a quick fix for T218494
  • 22:30 bd808: Investigating strange system state on tools-bastion-03.
  • 17:48 bstorm_: T218514 rebooting tools-worker-1009 and 1012
  • 17:46 bstorm_: depooling tools-worker-1009 and tools-worker-1012 for T218514
  • 17:13 bstorm_: depooled and rebooting tools-worker-1018
  • 15:09 andrewbogott: running 'killall dpkg and dpkg --configure -a' on all nodes to try to work around a race with initramfs

2019-03-16

  • 22:34 bstorm_: clearing errored out queues again

2019-03-15

  • 21:08 bstorm_: cleared error state on several queues T217280
  • 15:58 gtirloni: rebooted tools-clushmaster-02
  • 14:40 mutante: tools-sgebastion-07 - dpkg-reconfigure locales and adding Korean ko_KR.EUC-KR - T130532
  • 14:32 mutante: tools-sgebastion-07 - generating locales for user request in T130532

2019-03-14

  • 23:52 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{21,22,23,24,25,26,27,28,29,30,31,32} (T217152)
  • 23:28 bd808: Deleted tools-bastion-05 (T217152)
  • 22:30 bd808: Removed obsolete submit hosts from Trusty grid config
  • 22:20 bd808: Removed tools-webgrid-lighttpd-142{0,1,2,5} from the grid and shutdown instances via horizon (T217152)
  • 22:10 bd808: Depooled tools-webgrid-lighttpd-142{0,1,2,5} (T217152)
  • 21:55 bd808: Removed submit host flag from tools-bastion-05.tools.eqiad.wmflabs, removed floating ip, and shutdown instance via horizon (T217152)
  • 21:48 bd808: Removed tools-exec-14{33,34,35,36,37,38,39,40,41,42} from the grid and shutdown instances via horizon (T217152)
  • 21:38 gtirloni: rebooted tools-sgewebgrid-generic-0904 (T218341)
  • 21:32 gtirloni: rebooted tools-exec-1020 (T218341)
  • 21:23 gtirloni: rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 (T218341)
  • 21:19 bd808: Killed jobs still running on tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs 2 weeks after being depooled (T217152)
  • 20:58 bd808: Repooled tools-sgeexec-0941 following reboot
  • 20:57 bd808: Hard reboot of tools-sgeexec-0941 via horizon
  • 20:54 bd808: Depooled and rebooted tools-sgeexec-0941.tools.eqiad.wmflabs
  • 20:53 bd808: Repooled tools-sgeexec-0917 following reboot
  • 20:52 bd808: Hard reboot of tools-sgeexec-0917 via horizon
  • 20:47 bd808: Depooled and rebooted tools-sgeexec-0917
  • 20:44 bd808: Repooled tools-sgeexec-0908 after reboot
  • 20:36 bd808: depooled and rebooted tools-sgeexec-0908
  • 19:08 gtirloni: rebooted tools-worker-1028 (T218341)
  • 19:08 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914 (T218341)
  • 19:07 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914
  • 18:13 gtirloni: drained tools-worker-1028 for reboot (processes in D state)

2019-03-13

  • 23:30 bd808: Rebuilding stretch Kubernetes images
  • 22:55 bd808: Rebuilding jessie Kubernetes images
  • 17:11 bstorm_: specifically rebooted SGE cron server tools-sgecron-01
  • 17:10 bstorm_: rebooted cron server
  • 16:10 bd808: Updated DNS for dev.tools.wmflabs.org to point to Stretch secondary bastion. This was missed on 2019-03-07
  • 12:33 arturo: reboot tools-sgebastion-08 (T215154)
  • 12:17 arturo: reboot tools-sgebastion-07 (T215154)
  • 11:53 arturo: enable puppet in tools-sgebastion-07 (T215154)
  • 11:20 arturo: disable puppet in tools-sgebastion-07 for testing T215154
  • 05:07 bstorm_: re-enabled puppet for tools-sgebastion-07
  • 04:59 bstorm_: disabled puppet for a little bit on tools-bastion-07
  • 00:22 bd808: Raise web-memlimit for isbn tool to 6G for tomcat8 (T217406)

2019-03-11

  • 15:53 bd808: Manually started `service gridengine-master` on tools-sgegrid-master after reboot (T218038)
  • 15:47 bd808: Hard reboot of tools-sgegrid-master via Horizon UI (T218038)
  • 15:42 bd808: Rebooting tools-sgegrid-master (T218038)
  • 14:49 gtirloni: deleted tools-webgrid-lighttpd-1419
  • 00:53 bd808: Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization (T217280)

2019-03-10

  • 22:36 gtirloni: increased nscd group TTL from 60 to 300sec

2019-03-08

  • 19:48 andrewbogott: repooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
  • 19:21 andrewbogott: depooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
  • 17:49 bd808: Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization (T217280)
  • 00:30 bd808: DNS record created for trusty-dev.tools.wmflabs.org (Trusty secondary bastion)

2019-03-07

  • 23:31 bd808: Updated DNS to point login.tools.wmflabs.org at 185.15.56.48 (Stretch bastion)
  • 04:15 bd808: Killed 3 orphan processes on Trusty grid
  • 04:01 bd808: Cleared error state on a large number of Stretch grid queues which had been disabled by LDAP and/or NFS hiccups (T217280)
  • 00:49 zhuyifei1999_: clushed misctools 1.37 upgrade on @bastion,@cron,@bastion-stretch T217406
  • 00:38 zhuyifei1999_: published misctools 1.37 T217406
  • 00:34 zhuyifei1999_: begin building misctools 1.37 using debuild T217406

2019-03-06

  • 13:57 gtirloni: fixed SSH warnings in tools-clushmaster-02

2019-03-04

  • 19:07 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org for T217473
  • {{safesubst:SAL entry|1=14:05 gtirloni: rebooted tools-docker-registry-{03,04}, tools-puppetmaster-02 and tools-puppetdb-01 (load avg >45, not accessible)}}

2019-03-03

  • 20:54 andrewbogott: cleaning out /tmp on tools-exec-1412

2019-02-28

  • 19:36 zhuyifei1999_: built with debuild instead T217297
  • 19:08 zhuyifei1999_: test failures during build, see ticket
  • 18:55 zhuyifei1999_: start building jobutils 1.36 T217297

2019-02-27

  • 20:41 andrewbogott: restarting nginx on tools-checker-01
  • 19:34 andrewbogott: uncordoning tools-worker-1028, 1002 and 1005, now in eqiad1-r
  • 16:20 zhuyifei1999_: regenerating k8s creds for tools.whichsub & tools.permission-denied-test T176027
  • 15:40 andrewbogott: moving tools-worker-1002, 1005, 1028 to eqiad1-r
  • 01:36 bd808: Shutdown tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs via horizon (T217152)
  • 01:29 bd808: Depooled tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs (T217152)
  • 01:26 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs (T217152)

2019-02-26

  • 20:51 gtirloni: reboot tools-package-builder-02 (unresponsive)
  • 19:01 gtirloni: pushed updated docker images
  • 17:30 andrewbogott: draining and cordoning tools-worker-1027 for a region migration test

2019-02-25

  • 23:20 bstorm_: Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for T217066
  • 21:41 andrewbogott: depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test T217066
  • 13:11 chicocvenancio: PAWS: Stopped AABot notebook pod T217010
  • 12:54 chicocvenancio: PAWS: Restarted Criscod notebook pod T217010
  • 12:21 chicocvenancio: PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod T217010
  • 09:50 gtirloni: rebooted tools-sgeexec-09{16,22,40} (T216988)
  • 09:41 gtirloni: rebooted tools-sgeexec-09{16,22,40}
  • 08:37 zhuyifei1999_: uncordon tools-worker-1015.tools.eqiad.wmflabs
  • 08:34 legoktm: hard rebooted tools-worker-1015 via horizon
  • 07:48 zhuyifei1999_: systemd stuck in D state. :(
  • 07:44 zhuyifei1999_: I saved dmesg and process list to a few files in /root if that helps debugging
  • 07:43 zhuyifei1999_: D states are not responding to SIGKILL. Will reboot.
  • 07:37 zhuyifei1999_: tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining.

2019-02-22

  • 16:29 gtirloni: upgraded and rebooted tools-puppetmaster-01 (new kernel)
  • 15:59 gtirloni: started tools-puppetmaster-01 (new size: m1.large)
  • 15:13 gtirloni: shutdown tools-puppetmaster-01

2019-02-21

  • 09:59 gtirloni: upgraded all packages in all stretch nodes
  • 00:12 zhuyifei1999_: forcing puppet run on tools-k8s-master-01
  • 00:08 zhuyifei1999_: running /usr/local/bin/git-sync-upstream on tools-puppetmaster-01 to speed puppet changes up

2019-02-20

  • 23:30 zhuyifei1999_: begin rebuilding all docker images T178601 T193646 T215683
  • 23:25 zhuyifei1999_: upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version)
  • 23:19 zhuyifei1999_: this was built for stretch. hopefully it works for all distros
  • 23:17 zhuyifei1999_: begin build new tools-webservice package T178601 T193646 T215683
  • 21:57 andrewbogott: moving tools-static-13 to a new virt host
  • 21:34 andrewbogott: moving the tools-static IP from tools-static-13 to tools-static-12
  • 19:17 andrewbogott: moving tools-bastion-02 to labvirt1004
  • 16:56 andrewbogott: moving tools-paws-worker-1003
  • 15:53 andrewbogott: moving tools-worker-1017, tools-worker-1027, tools-worker-1028
  • 15:04 andrewbogott: moving tools-exec-1413 and tools-exec-1442

2019-02-19

  • 01:49 bd808: Revoked Toolforge project membership for user DannyS712 (T215092)

2019-02-18

  • 20:45 gtirloni: upgraded and rebooted tools-sgebastion-07 (login-stretch)
  • 20:22 gtirloni: enabled toolsdb monitoring in Icinga
  • 20:03 gtirloni: pointed tools-db.eqiad.wmflabs to 172.16.7.153
  • 18:50 chicocvenancio: moving paws back to toolsdb T216208
  • 13:47 arturo: rebooting tools-sgebastion-07 to try fixing general slowness

2019-02-17

  • 22:23 zhuyifei1999_: uncordon tools-worker-1010.tools.eqiad.wmflabs
  • 22:11 zhuyifei1999_: rebooting tools-worker-1010.tools.eqiad.wmflabs
  • 22:10 zhuyifei1999_: draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever

2019-02-16

  • 05:00 zhuyifei1999_: fixed by restarting flannel. another puppet run simply started kubelet
  • 04:58 zhuyifei1999_: puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory'
  • 04:52 zhuyifei1999_: copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet
  • 04:48 zhuyifei1999_: that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago)
  • 04:44 zhuyifei1999_: puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known'
  • 04:43 zhuyifei1999_: this one has logs full of 'Can't contact LDAP server'
  • 04:41 zhuyifei1999_: nslcd also broken on tools-worker-1005
  • 04:34 zhuyifei1999_: uncordon tools-worker-1014.tools.eqiad.wmflabs
  • 04:33 zhuyifei1999_: the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT
  • 04:31 zhuyifei1999_: then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs
  • 04:30 zhuyifei1999_: `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work
  • 04:23 zhuyifei1999_: drained tools-worker-1014.tools.eqiad.wmflabs
  • 04:16 zhuyifei1999_: logs: https://phabricator.wikimedia.org/P8095
  • 04:14 zhuyifei1999_: restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs
  • 04:12 zhuyifei1999_: restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP

2019-02-14

  • 21:57 bd808: Deleted old tools-proxy-02 instance
  • 21:57 bd808: Deleted old tools-proxy-01 instance
  • 21:56 bd808: Deleted old tools-package-builder-01 instance
  • 20:57 andrewbogott: rebooting tools-worker-1005
  • 20:34 andrewbogott: moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419
  • 19:55 andrewbogott: moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419
  • 19:33 andrewbogott: moving tools-checker-01 to labvirt1003
  • 19:25 andrewbogott: moving tools-elastic-02 to labvirt1003
  • 19:11 andrewbogott: moving tools-k8s-etcd-01 to labvirt1002
  • 18:37 andrewbogott: moving tools-exec-1418, tools-exec-1424 to labvirt1003
  • 18:34 andrewbogott: moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002
  • 17:35 arturo: T215154 tools-sgebastion-07 now running systemd 239 and starts enforcing user limits
  • 15:33 andrewbogott: moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r

2019-02-13

  • 19:16 andrewbogott: deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06
  • 15:16 zhuyifei1999_: `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml|awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042
  • 15:06 zhuyifei1999_: `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042
  • 13:03 arturo: T216030 switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07

2019-02-12

  • 01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers (T215704)

2019-02-11

  • 22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
  • 22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
  • 22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
  • 22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
  • 22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
  • 22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
  • 22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
  • 22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
  • 20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
  • 19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 (T107878)
  • 18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 (T107878)
  • 18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos (T107878)
  • 18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 (T107878)
  • 18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos (T107878)
  • 18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 (T107878)
  • 18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos (T107878)
  • 10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1

2019-02-08

  • 19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
  • 18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for T210829.
  • 13:49 gtirloni: upgraded all packages in SGE cluster
  • 12:25 arturo: install aptitude in tools-sgebastion-06
  • 11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - T215272
  • 01:07 bd808: Creating tools-sgebastion-07

2019-02-07

  • 23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
  • 20:18 gtirloni: cleared mail queue on tools-mail-02
  • 08:41 godog: upgrade prometheus-02 to prometheus 2.6 - T215272

2019-02-04

  • 13:20 arturo: T215154 another reboot for tools-sgebastion-06
  • 12:26 arturo: T215154 another reboot for tools-sgebastion-06. Puppet is disabled
  • 11:38 arturo: T215154 reboot tools-sgebastion-06 to totally refresh systemd status
  • 11:36 arturo: T215154 manually install systemd 239 in tools-sgebastion-06

2019-01-30

  • 23:54 gtirloni: cleared apt cache on sge* hosts

2019-01-25

  • 20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch (T214668)
  • 14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for T214447
  • 14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for T214447

2019-01-24

  • 11:09 arturo: T213421 delete tools-services-01/02
  • 09:46 arturo: T213418 delete tools-docker-registry-02
  • 09:45 arturo: T213418 delete tools-docker-builder-05 and tools-docker-registry-01
  • 03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01

2019-01-23

  • 22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image (T214519)
  • 22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image (T214519)
  • 21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance (T214519)
  • 20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon (T214519)
  • 19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
  • 17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 (T211684)

2019-01-22

  • 20:21 gtirloni: published new docker images (all)
  • 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs

2019-01-21

  • 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet

2019-01-18

  • 21:22 bd808: Forcing php-igbinary update via clush for T213666

2019-01-17

  • 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
  • 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
  • 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
  • 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
  • 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
  • 17:16 arturo: T213421 shutdown tools-services-01/02. Will delete VMs after a grace period
  • 12:54 arturo: add webservice security group to tools-sge-services-03/04

2019-01-16

  • 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
  • 16:38 arturo: T213418 shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
  • 14:34 arturo: T213418 point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
  • 14:24 arturo: T213418 allocate floating IPs for tools-docker-registry-03 & 04

2019-01-15

  • 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
  • 18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
  • 18:29 bstorm_: T213711 installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
  • 14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
  • 14:21 arturo: T213418 put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`

2019-01-14

  • 22:03 bstorm_: T213711 Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
  • 22:03 bstorm_: T213711 Added ports needed for etcd-flannel to work on the etcd security group in eqiad
  • 21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
  • 21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
  • 21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
  • 21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
  • 16:44 arturo: T213418 docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
  • 14:00 arturo: T213421 disable updatetools in the new services nodes while building them
  • 13:53 arturo: T213421 delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
  • 13:47 arturo: T213421 create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`

2019-01-11

  • 11:55 arturo: T213418 shutdown tools-docker-builder-05, will give a grace period before deleting the VM
  • 10:51 arturo: T213418 created tools-docker-builder-06 in eqiad1
  • 10:46 arturo: T213418 migrating tools-docker-registry-02 from eqiad to eqiad1

2019-01-10

  • 22:45 bstorm_: T213357 - Added 24 lighttpd nodes tot he new grid
  • 18:54 bstorm_: T213355 built and configured two more generic web nodes for the new grid
  • 10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
  • 00:12 bstorm_: T213353 Added 36 exec nodes to the new grid

2019-01-09

  • 20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
  • 17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
  • 14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
  • 09:59 gtirloni: rebooted tools-checker-01 (T213252)

2019-01-07

  • 17:21 bstorm_: T67777 - set the max_u_jobs global grid config setting to 50 in the new grid
  • 15:54 bstorm_: T67777 Set stretch grid user job limit to 16
  • 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.

2019-01-06

  • 22:06 bd808: Added floating ip to tools-sgebastion-06 (T212360)

2019-01-05

  • 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.

2019-01-04

  • 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history

2019-01-03

  • 21:03 bd808: Enabled Puppet on tools-proxy-02
  • 20:53 bd808: Disabled Puppet on tools-proxy-02
  • 20:51 bd808: Enabled Puppet on tools-proxy-01
  • 20:49 bd808: Disabled Puppet on tools-proxy-01

2018-12-21

  • 16:29 andrewbogott: migrating tools-exec-1416 to labvirt1004
  • 16:01 andrewbogott: moving tools-grid-master to labvirt1004
  • 00:35 bd808: Installed tools-manifest 0.14 for T212390
  • 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for T212390
  • 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for T212390
  • 00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for T212390

2018-12-20

  • 20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
  • 20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
  • 20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002

2018-12-17

  • 22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - T212153
  • 19:18 gtirloni: decreased nfs-mount-manager verbosity (T211817)
  • 19:02 arturo: T211977 add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
  • 13:46 arturo: T211977 `aborrero@tools-services-01:~$ sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`

2018-12-11

  • 13:19 gtirloni: Removed BigBrother (T208357)

2018-12-05

  • 12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster (T196973)

2018-12-04

  • 22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage T164123
  • 20:03 gtirloni: removed floating IPs from tools-k8s-master-01 (T164123)

2018-12-01

  • 02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 (T194615)
  • 00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts

2018-11-30

  • 23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
  • 22:18 gtirloni: Pushed new jdk8 docker image based on stretch (T205774)
  • 18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance (T194615)

2018-11-27

  • 17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb

2018-11-26

  • 17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) (T210190)
  • 17:34 gtirloni: T186571 removed legofan4000 user from project-tools group (again)
  • 13:31 gtirloni: deleted instance tools-clushmaster-01 (T209701)

2018-11-20

  • 23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
  • 14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
  • 13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
  • 10:52 arturo: T208579 distributing now misctools and jobutils 1.33 in all aptly repos
  • 09:43 godog: restart prometheus@tools on prometheus-01

2018-11-16

  • 21:16 bd808: Ran grid engine orphan process kill script from T153281. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
  • 17:47 gtirloni: deleted tools-mail instance
  • 17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
  • 17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
  • 13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades

2018-11-14

  • 17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
  • 17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
  • 17:15 andrewbogott: moving tools-exec-1420 to labvirt1009

2018-11-13

  • 17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo (T207970)
  • 13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
  • 13:29 gtirloni: Changed active mail relay to tools-mail-02 (T209356)
  • 13:22 arturo: T207970 misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
  • 13:05 arturo: T207970 there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
  • 12:59 arturo: the puppet issue has been solved by reverting the code
  • 12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit

2018-11-08

  • 18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
  • 17:58 arturo: installing jobutils and misctools v1.32 (T207970)
  • 17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
  • 16:56 gtirloni: cleaned up /tmp on tools-bastion-05
  • 16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
  • 16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
  • 11:32 gtirloni: removed temporary /var/mail fix (T208843)

2018-11-07

  • 10:37 gtirloni: removed invalid apt.conf.d file from all hosts (T110055)

2018-11-02

  • 18:11 arturo: T206223 some disturbances due to the certificate renewal
  • 17:04 arturo: renewing *.wmflabs.org T206223

2018-10-31

  • 18:02 gtirloni: truncated big .err and error.log files
  • 13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde

2018-10-29

  • 17:00 bd808: Ran grid engine orphan process kill script from T153281

2018-10-26

  • 10:34 arturo: T207970 added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
  • 10:32 arturo: T209970 added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo

2018-10-19

  • 14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
  • 00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017

2018-10-18

  • 19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017

2018-10-16

  • 15:13 bd808: (repost for gtirloni) T186571 removed legofan4000 user from project-tools group (leftover from T165624 legofan4000->macfan4000 rename)

2018-10-07

  • 21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 T194859
  • 21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
  • 21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens

2018-09-21

  • 12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
  • 12:14 arturo: T205078 same for {jessie,stretch}-wikimedia
  • 12:12 arturo: T205078 upgrade trusty-wikimedia packages (git-fat, debmonitor)
  • 11:57 arturo: T205078 purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines

2018-09-17

  • 09:13 arturo: T204481 aborrero@tools-mail:~$ sudo exiqgrep -i | xargs sudo exim -Mrm

2018-09-14

  • 11:22 arturo: T204267 stop the corhist tool (k8s) because is hammering the wikidata API
  • 10:51 arturo: T204267 stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API

2018-09-08

  • 10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog (T196137)

2018-09-07

  • 05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb

2018-08-27

  • 23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` T202932
  • 23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
  • 22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` T202932

2018-08-22

  • 13:02 arturo: I used this command: `sudo exim -bp | sudo exiqgrep -i | xargs sudo exim -Mrm`
  • 13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com

2018-08-19

2018-08-14

2018-08-13

  • 23:31 legoktm: rebuilding docker images for webservice upgrade
  • 23:16 legoktm: published toollabs-webservice_0.41_all.deb
  • 23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice

2018-08-09

  • 10:40 arturo: T201602 upgrade packages from jessie-backports (excluding python-designateclient)
  • 10:30 arturo: T201602 upgrade packages from jessie-wikimedia
  • 10:27 arturo: T201602 upgrade packages from trusty-updates

2018-08-08

  • 10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images T156626 T148872 T158244

2018-08-06

  • 12:33 arturo: T197176 installing texlive-full in toolforge

2018-08-01

  • 14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break

2018-07-30

  • 20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
  • 04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools

2018-07-27

  • 04:52 zhuyifei1999_: rebuilding python/base docker container T190274

2018-07-25

  • 19:02 chasemp: tools-worker-1004 reboot
  • 19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)

2018-07-18

  • 13:24 arturo: upgrading packages from `stretch-wikimedia` T199905
  • 13:18 arturo: upgrading packages from `stable` T199905
  • 12:51 arturo: upgrading packages from `oldstable` T199905
  • 12:31 arturo: upgrading packages from `trusty-updates` T199905
  • 12:16 arturo: upgrading packages from `jessie-wikimedia` T199905
  • 12:09 arturo: upgrading packages from `trusty-wikimedia` T199905

2018-06-30

  • 18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
  • 16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
  • 16:39 zhuyifei1999_: reboot tools-paws-master-01
  • 16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
  • 16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere

2018-06-29

  • 17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
  • 17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU (T123121)
  • 16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. T182070

2018-06-28

  • 19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
  • 18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
  • 17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
  • 17:20 chasemp: tools-worker-1007:~# /sbin/reboot
  • 16:48 arturo: rebooting tools-docker-registry-01
  • 16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
  • 16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck

2018-06-21

  • 13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash

2018-06-20

  • 15:09 bd808: Killed orphan processes on webgrid nodes (T182070); most owned by jembot and croptool

2018-06-14

  • 14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash

2018-06-11

  • 10:11 arturo: T196137 `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null | grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart || true'`

2018-06-08

  • 07:46 arturo: T196137 more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes

2018-06-07

  • 11:01 arturo: T196137 force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`

2018-06-06

  • 22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt (T196589)
  • 21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
  • 20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
  • 19:04 chasemp: tools-bastion-03 is virtually unusable
  • 09:49 arturo: T196137 aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid

2018-06-05

  • 18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben (T196486)
  • 17:39 arturo: T196137 clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
  • 17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs (T196486)

2018-06-04

  • 10:28 arturo: T196006 installing sqlite3 package in exec nodes

2018-06-03

  • 10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' T195834

2018-05-31

2018-05-30

  • 10:52 zhuyifei1999_: undid both changes to tools-bastion-05
  • 10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
  • 10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close T195834

2018-05-28

  • 12:09 arturo: T194665 adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
  • 12:06 arturo: T194665 adding mono packages to apt.wikimedia.org for trusty-wikimedia

2018-05-25

  • 05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty T195558

2018-05-22

2018-05-18

  • 16:36 bd808: Restarted bigbrother on tools-services-02

2018-05-16

  • 21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds

2018-05-15

  • 04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414. It's hanging for unknown reasons.
  • 04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
  • 04:05 zhuyifei1999_: Force deletion of grid job 5221417 (tools.giftbot sga), host tools-exec-1414 not responding

2018-05-12

  • 10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop | T194343

2018-05-11

  • 14:34 andrewbogott: repooling labvirt1001 tools instances
  • 13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for T194258: tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407

2018-05-10

  • 18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update

2018-05-09

  • 21:11 Reedy: Added Tim Starling as member/admin

2018-05-07

  • 21:02 zhuyifei1999_: re-building all docker images T190893
  • 20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 T190893
  • 00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours

2018-05-05

  • 23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing

2018-05-03

  • 14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package T192566

2018-05-01

  • 14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)

2018-04-27

  • 18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
  • 18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker

2018-04-23

  • 14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools T192732

2018-04-22

  • 13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -E " 1 " | grep php-cgi | xargs sudo kill -9'`

2018-04-15

  • 17:51 zhuyifei1999_: forced puppet puns across tools-elastic-0[1-3] T192224
  • 17:45 zhuyifei1999_: granted elasticsearch credentials to tools.flaky-ci T192224

2018-04-11

  • 13:25 chasemp: cleanup exim frozen messages in an effort to aleve queue pressure

2018-04-06

  • 16:30 chicocvenancio: killed job in bastion, tools.gpy affected
  • 14:30 arturo: add puppet class `toollabs::apt_pinning` to tools-puppetmaster-01 using horizon, to add some apt pinning related to T159254
  • 11:23 arturo: manually upgrade apache2 on tools-puppemaster for T159254

2018-04-05

  • 18:46 chicocvenancio: killed wget that was hogging io

2018-03-29

  • 20:09 chicocvenancio: killed interactive processes in tools-bastion-03
  • 19:56 chicocvenancio: several interactive jobs running in bastion-03. I am writing to connected users and will kill the jobs once done

2018-03-28

  • 13:06 zhuyifei1999_: SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid

2018-03-26

  • 21:34 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'

2018-03-23

2018-03-22

  • 22:04 bd808: Forced puppet run on tools-proxy-02 for T130748
  • 21:52 bd808: Forced puppet run on tools-proxy-01 for T130748
  • 21:48 bd808: Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout
  • 03:50 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'

2018-03-21

  • 17:50 bd808: Cleaned up stale /project/.system/bigbrother.scoreboard.* files from labstore1004
  • 01:09 bd808: Deleting /tmp files owned by tools.wsexport with -mtime +2 across grid (T190185)

2018-03-20

  • 08:28 zhuyifei1999_: unmount dumps & remount on tools-bastion-02 (can someone clush this?) T189018 T190126

2018-03-19

  • 11:02 arturo: reboot tools-exec-1408, to balance load. Server is unresponsive due to high load by some tools

2018-03-16

  • 22:44 zhuyifei1999_: suspended process 22825 (BotOrderOfChapters.exe) on tools-bastion-03. Threads continuously going to D-state & R-state. Also sent message via $ write on pts/10
  • 12:13 arturo: reboot tools-webgrid-lighttpd-1420 due to almost full /tmp

2018-03-15

  • 16:56 zhuyifei1999_: granted elasticsearch credentials to tools.denkmalbot T185624

2018-03-14

  • 20:57 bd808: Upgrading elasticsearch on tools-elastic-01 (T181531)
  • 20:53 bd808: Upgrading elasticsearch on tools-elastic-02 (T181531)
  • 20:51 bd808: Upgrading elasticsearch on tools-elastic-03 (T181531)
  • 12:07 arturo: reboot tools-webgrid-lighttpd-1415, almost full /tmp
  • 12:01 arturo: repool tools-webgrid-lighttpd-1421, /tmp is now empty
  • 11:56 arturo: depool tools-webgrid-lighttpd-1421 for reboot due to /tmp almost full

2018-03-12

  • 20:09 madhuvishy: Run clush -w @all -b 'sudo umount /mnt/nfs/labstore1003-scratch && sudo mount -a' to remount scratch across all of tools
  • 17:13 arturo: T188994 upgrading packages from `stable`
  • 16:53 arturo: T188994 upgrading packages from stretch-wikimedia
  • 16:33 arturo: T188994 upgrading packages form jessie-wikimedia
  • 14:58 zhuyifei1999_: building, publishing, and deploying misctools 1.31 5f3561e T189430
  • 13:31 arturo: tools-exec-1441 and tools-exec-1442 rebooted fine and are repooled
  • 13:26 arturo: depool tools-exec-1441 and tools-exec-1442 for reboots
  • 13:19 arturo: T188994 upgrade packages from jessie-backports in all jessie servers
  • 12:49 arturo: T188994 upgrade packages from trusty-updates in all ubuntu servers
  • 12:34 arturo: T188994 upgrade packages from trusty-wikimedia in all ubuntu servers

2018-03-08

  • 16:05 chasemp: tools-clushmaster-01:~$ clush -g all 'sudo puppet agent --test'
  • 14:02 arturo: T188994 upgrading trusty-tools packages in all the cluster, this includes jobutils, openssh-server and openssh-sftp-server

2018-03-07

2018-03-06

  • 16:15 madhuvishy: Reboot tools-docker-registry-02 T189018
  • 15:50 madhuvishy: Rebooting tools-worker-1011
  • 15:08 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs
  • 15:03 arturo: drain and reboot tools-worker-1011
  • 15:03 chasemp: rebooted tools-worker 1001-1008
  • 14:58 arturo: drain and reboot tools-worker-1010
  • 14:27 chasemp: multiple tools running on k8s workers report issues reading replica.my.cnf file atm
  • 14:27 chasemp: reboot tools-worker-100[12]
  • 14:23 chasemp: downtime icinga alert for k8s workers ready
  • 13:21 arturo: T188994 in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced
  • 12:58 arturo: T188994 upgrading packages in jessie nodes from the oldstable source
  • 11:42 arturo: clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem
  • 11:41 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide
  • 11:36 arturo: (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic (T188911)
  • 11:33 arturo: removing unused kernel packages in ubuntu nodes
  • 11:08 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster

2018-03-05

  • 18:56 zhuyifei1999_: also published jobutils_1.30_all.deb
  • 18:39 zhuyifei1999_: built and published misctools_1.30_all.deb T167026 T181492
  • 14:33 arturo: delete `linux-image-4.9.0-6-amd64` package from stretch instances for T188911
  • 14:01 arturo: deleting old kernel packages in jessie instances for T188911
  • 13:58 arturo: running `apt-get autoremove` with clush in all jessie instances
  • 12:16 arturo: apply role::toollabs::base to tools-paws prefix in horizon for T187193
  • 12:10 arturo: apply role::toollabs::base to tools-prometheus prefix in horizon for T187193

2018-03-02

  • 13:41 arturo: doing some testing with puppet classes in tools-package-builder-01 via horizon

2018-03-01

2018-02-27

  • 17:37 chasemp: add chico as admin to toolsbeta
  • 12:23 arturo: running `apt-get autoclean` in canary servers
  • 12:16 arturo: running `apt-get autoremove` in canary servers

2018-02-26

  • 19:17 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --test"
  • 10:35 arturo: enable puppet in tools-proxy-01
  • 10:23 arturo: disable puppet in tools-proxy-01 for apt pinning tests

2018-02-25

  • 19:04 chicocvenancio: killed jobs in tools-bastion-03, wrote notice to tools owners' terminals

2018-02-23

  • 19:11 arturo: enable puppet in tools-proxy-01
  • 18:53 arturo: disable puppet in tools-proxy-01 for apt preferences testing
  • 13:52 arturo: deploying https://gerrit.wikimedia.org/r/#/c/413725/ across the fleet
  • 13:04 arturo: install apt-rdepends in tools-paws-master-01 which triggered some python libs to be upgraded

2018-02-22

  • 16:31 bstorm_: Enabled puppet on tools-static-12 as the test server

2018-02-21

  • 19:02 bstorm_: disabled puppet on tools-static-* pending change 413197
  • 18:15 arturo: puppet should be fine across the fleet
  • 17:24 arturo: another try: merged https://gerrit.wikimedia.org/r/#/c/413202/
  • 17:02 arturo: revert last change https://gerrit.wikimedia.org/r/#/c/413198/
  • 16:59 arturo: puppet is broken across the cluster due to last change
  • 16:57 arturo: deploying https://gerrit.wikimedia.org/r/#/c/410177/
  • 16:26 bd808: Rebooting tools-docker-registry-01, NFS mounts are in a bad state
  • 11:43 arturo: package upgrades in tools-webgrid-lightttpd-1401
  • 11:35 arturo: package upgrades in tools-package-builder-01 tools-prometheus-01 tools-static-10 and tools-redis-1001
  • 11:22 arturo: package upgrades in tools-mail, tools-grid-master, tool-logs-02
  • 10:51 arturo: package upgrades in tools-checker-01 tools-clushmaster-01 and tools-docker-builder-05
  • 09:18 chicocvenancio: killed io intensive tool job in bastion
  • 03:32 zhuyifei1999_: removed /data/project/.elasticsearch.ini, owned by root and mode 644, leaks the creds of /data/project/strephit/.elasticsearch.ini Might need to cycle it as well...

2018-02-20

  • 12:42 arturo: upgrading tools-flannel-etcd-01
  • 12:42 arturo: upgrading tools-k8s-etcd-01

2018-02-19

  • 19:13 arturo: upgrade all packages of tools-services-01
  • 19:02 arturo: move tools-services-01 from puppet3 to puppet4 (manual package upgrade). No issues detected.
  • 18:23 arturo: upgrade packages of tools-cron-01 from all channels (trusty-wikimedia, trusty-updates and trusty-tools)
  • 12:54 arturo: puppet run with clush to ensure puppet is back to normal after being broken due to duplicated python3 declaration

2018-02-16

  • 18:21 arturo: upgrading tools-proxy-01 and tools-paws-master-01, same as others
  • 17:36 arturo: upgrading oldstable, jessie-backports, jessie-wikimedia packages in tools-k8s-master-01 (excluding linux*, libpam*, nslcd)
  • 13:00 arturo: upgrades in tools-exec-14[01-10].eqiad.wmflabs were fine
  • 12:42 arturo: aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'DEBIAN_FRONTEND=noninteractive sudo apt-upgrade -u upgrade trusty-updates -y'
  • 11:58 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-wikimedia -y
  • 11:57 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-backports -y
  • 11:53 arturo: (10 exec canary nodes) aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'sudo apt-upgrade -u upgrade trusty-wikimedia -y'
  • 11:41 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade oldstable -y

2018-02-15

  • 13:54 arturo: cleanup ferm (deinstall) in tools-services-01 for T187435
  • 13:28 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-tools
  • 13:16 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-updates -y
  • 13:13 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
  • 13:06 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-tools
  • 12:57 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-updates
  • 12:51 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-wikimedia

2018-02-14

  • 13:09 arturo: the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment (T187315)
  • 13:04 arturo: reboot tools-paws-master-01 for T187315

2018-02-11

  • 01:28 zhuyifei1999_: `# find /home/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected: only /home/tr8dr, mode 0777 -> 0775
  • 01:21 zhuyifei1999_: `# find /data/project/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected tools: wikisource-tweets, gsociftttdev, dow, ifttt-testing, elobot. All mode 2777 -> 2775

2018-02-09

  • 10:35 arturo: deploy https://gerrit.wikimedia.org/r/#/c/409226/ T179343 T182562 T186846
  • 06:15 bd808: Killed orphan processes owned by iabot, dupdet, and wsexport scattered across the webgrid nodes
  • 05:07 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1426
  • 05:06 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1411
  • 05:05 bd808: Killed 1 orphan php-fcgi process from jembot that were running on tools-webgrid-lighttpd-1409
  • 05:02 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1421 and pegging the cpu there
  • 04:56 bd808: Rescheduled 30 of the 60 tools running on tools-webgrid-lighttpd-1421 (T186830)
  • 04:39 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1417 and pegging the cpu there

2018-02-08

  • 18:38 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1002.tools.eqiad.wmflabs
  • 18:35 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade jessie-wikimedia -v
  • 18:33 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade oldstable -v
  • 18:28 arturo: cordon & drain tools-worker-1002.tools.eqiad.wmflabs
  • 18:10 arturo: uncordon tools-paws-worker-1019. Package upgrades were OK.
  • 18:08 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stable -v
  • 18:06 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stretch-wikimedia -v
  • 18:02 arturo: cordon tools-paws-worker-1019 to do some package upgrades
  • 17:29 arturo: repool tools-exec-1401.tools.eqiad.wmflabs. Package upgrades were OK.
  • 17:20 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-updates -vy
  • 17:15 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-wikimedia -vy
  • 17:11 arturo: depool tools-exec-1401.tools.eqiad.wmflabs to do some package upgrades
  • 14:22 arturo: it was some kind of transient error. After a second puppet run across the fleet, all seems fine
  • 13:53 arturo: deploy https://gerrit.wikimedia.org/r/#/c/407465/ which is causing some puppet issues. Investigating.

2018-02-06

  • 13:15 arturo: deploy https://gerrit.wikimedia.org/r/#/c/408529/ to tools-services-01
  • 13:05 arturo: unpublish/publish trusty-tools repo
  • 13:03 arturo: install aptly v0.9.6-1 in tools-services-01 for T186539 after adding it to trusty-tools repo (self contained)

2018-02-05

  • 17:58 arturo: publishing/unpublishing trusty-tools repo in tools-services-01 to address T186539
  • 13:27 arturo: for the record, not a single warning or error (orange/red messages) in puppet in the toolforge cluster
  • 13:06 arturo: deploying fix for T186230 using clush

2018-02-03

  • 01:04 chicocvenancio: killed io intensive process in bastion-03 "vltools python3 ./broken_ref_anchors.py"

2018-01-31

  • 22:54 chasemp: add bstorm to sudoers as root

2018-01-29

  • 20:02 chasemp: add zhuyifei1999_ tools root for T185577
  • 20:01 chasemp: blast a puppet run to see if any errors are persistent

2018-01-28

  • 22:49 chicocvenancio: killed compromised session generating miner processes
  • 22:48 chicocvenancio: killed miner processes in tools-bastion-03

2018-01-27

  • 00:55 arturo: at tools-static-11 the kernel OOM killer stopped git gc at about 20% :-(
  • 00:25 arturo: (/srv is almost full) aborrero@tools-static-11:/srv/cdnjs$ sudo git gc --aggressive

2018-01-25

  • 23:47 arturo: fix last deprecation warnings in tools-elastic-03, tools-elastic-02, tools-proxy-01 and tools-proxy-02 by replacing by hand configtimeout with http_configtimeout in /etc/puppet/puppet.conf
  • 23:20 arturo: T179386 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
  • 05:25 arturo: deploying misctools and jobutils 1.29 for T179386

2018-01-23

  • 19:41 madhuvishy: Add bstorm to project admins
  • 15:48 bd808: Admin clean up; removed Coren, Ryan Lane, and Springle.
  • 14:17 chasemp: add me, arturo, chico to sudoers and removed marc

2018-01-22

  • 18:32 arturo: T181948 T185314 deploying jobutils and misctools v1.28 in the cluster
  • 11:21 arturo: puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790
  • 10:31 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v' <--- check again how is the cluster with puppet
  • 10:18 arturo: T181948 deploy misctools 1.27 in the cluster

2018-01-19

  • 17:32 arturo: T185314 deploying new version of jobutils 1.27
  • 12:56 arturo: the puppet status across the fleet seems good, only minor things like T185314 , T179388 and T179386
  • 12:39 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'

2018-01-18

  • 16:11 arturo: aborrero@tools-clushmaster-01:~$ sudo aptitude purge vblade vblade-persist runit (for something similar to T182781)
  • 15:42 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
  • 13:52 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -f 1 -w @all 'sudo facter | grep lsbdistcodename | grep trusty && sudo apt-upgrade trusty-wikimedia -v'
  • 13:44 chasemp: upgrade wikimedia packages on tools-bastion-05
  • 12:24 arturo: T178717 aborrero@tools-exec-1401:~$ sudo apt-upgrade trusty-wikimedia -v
  • 12:11 arturo: T178717 aborrero@tools-webgrid-generic-1402:~$ sudo apt-upgrade trusty-wikimedia -v
  • 11:42 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'

2018-01-17

  • 18:47 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions | grep upgradeable | grep trusty-wikimedia' | tee pending-upgrades-report-trusty-wikimedia.txt
  • 17:55 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' | tee pending-upgrades-report.txt
  • 15:15 andrewbogott: running purge-old-kernels on all Trusty exec nodes
  • 15:15 andrewbogott: repooling exec-manage tools-exec-1430.
  • 15:04 andrewbogott: depooling exec-manage tools-exec-1430. Experimenting with purge-old-kernels
  • 14:09 arturo: T181647 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'

2018-01-16

  • 22:01 chasemp: qstat -explain E -xml | grep 'name' | sed 's/<name>//' | sed 's/<\/name>//' | xargs qmod -cq
  • 21:54 chasemp: tools-exec-1436:~$ /sbin/reboot
  • 21:24 andrewbogott: repooled tools-exec-1420 and tools-webgrid-lighttpd-1417
  • 21:14 andrewbogott: depooling tools-exec-1420 and tools-webgrid-lighttpd-1417
  • 20:58 andrewbogott: depooling tools-exec-1412, 1415, 1417, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
  • 20:56 andrewbogott: repooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
  • 20:46 andrewbogott: depooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
  • 20:46 andrewbogott: repooled tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
  • 20:33 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
  • 20:20 andrewbogott: depooling tools-webgrid-lighttpd-1412 and tools-exec-1423 for host reboot
  • 20:19 andrewbogott: repooled tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
  • 20:02 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
  • 20:00 andrewbogott: depooled and repooled tools-webgrid-lighttpd-1427 tools-webgrid-lighttpd-1428 tools-exec-1413 tools-exec-1442 for host reboot
  • 18:50 andrewbogott: switched active proxy back to tools-proxy-02
  • 18:50 andrewbogott: repooling tools-exec-1422 and tools-webgrid-lighttpd-1413
  • 18:34 andrewbogott: moving proxy from tools-proxy-02 to tools-proxy-01
  • 18:31 andrewbogott: depooling tools-exec-1422 and tools-webgrid-lighttpd-1413 for host reboot
  • 18:26 andrewbogott: repooling tools-exec-1404 and 1434 for host reboot
  • 18:06 andrewbogott: depooling tools-exec-1404 and 1434 for host reboot
  • 18:04 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
  • 17:48 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
  • 17:28 andrewbogott: disabling tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403, tools-exec-1403 for host reboot
  • 17:26 andrewbogott: repooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 after host reboot
  • 17:08 andrewbogott: depooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 for host reboot
  • 16:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 after host reboot
  • 15:52 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 for host reboot
  • 13:35 chasemp: tools-mail almouked@ltnet.net 719 pending messages cleared

2018-01-11

  • 20:33 andrewbogott: repooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
  • 20:33 andrewbogott: uncordoning tools-worker-1012 and tools-worker-1017
  • 20:06 andrewbogott: cordoning tools-worker-1012 and tools-worker-1017
  • 20:02 andrewbogott: depooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
  • 19:00 chasemp: reboot tools-worker-1015
  • 15:08 chasemp: reboot tools-exec-1405
  • 15:06 chasemp: reboot tools-exec-1404
  • 15:06 chasemp: reboot tools-exec-1403
  • 15:02 chasemp: reboot tools-exec-1402
  • 14:57 chasemp: reboot tools-exec-1401 again...
  • 14:53 chasemp: reboot tools-exec-1401
  • 14:46 chasemp: install metltdown kernel and reboot workers 1011-1016 as jessie pilot

2018-01-10

  • 15:14 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
  • 15:03 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1016`; do kubectl cordon $n; done
  • 14:41 chasemp: tools-clushmaster-01:~$ clush -w @k8s-worker "sudo puppet agent --disable 'chase rollout'"
  • 14:01 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1001.tools.eqiad.wmflabs
  • 13:57 arturo: T184604 cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs
  • 13:46 arturo: T184604 aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs
  • 13:45 arturo: T184604 aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes
  • 13:26 arturo: sudo kubectl drain tools-worker-1020.tools.eqiad.wmflabs
  • 13:22 arturo: empty by hand syslog and daemon.log files. They are so big that logrotate won't handle them
  • 13:20 arturo: aborrero@tools-worker-1020:~$ sudo service kubelet restart
  • 13:18 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for T184604
  • 13:13 arturo: detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened T184604

2018-01-09

  • 23:21 yuvipanda: paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading
  • 23:08 yuvipanda: turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1
  • 23:01 yuvipanda: kill paws master and reboot it
  • 22:54 yuvipanda: kill all kube-system pods in paws cluster
  • 22:54 yuvipanda: kill all PAWS pods
  • 22:53 yuvipanda: redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason
  • 22:49 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8
  • 22:48 yuvipanda: run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash to setup kubeadm on all paws worker nodes
  • 22:46 yuvipanda: reboot all paws-worker nodes
  • 22:46 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
  • 22:46 madhuvishy: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
  • 21:17 chasemp: ...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
  • 21:17 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test"
  • 21:10 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done
  • 20:55 chasemp: for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016`; do kubectl cordon $n; done
  • 20:51 chasemp: kubectl cordon tools-worker-1001.tools.eqiad.wmflabs
  • 20:15 chasemp: disable puppet on proxies and k8s workers
  • 19:50 chasemp: clush -w @all 'sudo puppet agent --test'
  • 19:42 chasemp: reboot tools-worker-1010

2018-01-08

  • 20:34 madhuvishy: Restart kube services and uncordon tools-worker-1001
  • 19:26 chasemp: sudo service docker restart; sudo service flannel restart; sudo service kube-proxy restart on tools-proxy-02

2018-01-06

  • 00:35 madhuvishy: Run `clush -w @paws-worker -b 'sudo iptables -L FORWARD'`
  • 00:05 madhuvishy: Drain and cordon tools-worker-1001 (for debugging the dns outage)

2018-01-05

  • 23:49 madhuvishy: Run clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart' on tools-clushmaster-01
  • 16:22 andrewbogott: moving tools-worker-1027 to labvirt1015 (CPU balancing)
  • 16:01 andrewbogott: moving tools-worker-1017 to labvirt1017 (CPU balancing)
  • 15:32 andrewbogott: moving tools-exec-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
  • 15:18 andrewbogott: moving tools-exec-1411.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 15:02 andrewbogott: moving tools-exec-1440.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 14:47 andrewbogott: moving tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 14:25 andrewbogott: moving tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
  • 14:05 andrewbogott: moving tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
  • 13:46 andrewbogott: moving tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 05:33 andrewbogott: migrating tools-worker-1012 to labvirt1017 (CPU load balancing)

2018-01-04

  • 17:24 andrewbogott: rebooting tools-paws-worker-1019 to verify repair of T184018

2018-01-03

Archives