You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Nova Resource:Tools/SAL
Jump to navigation
Jump to search
2019-02-12
- 01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers (T215704)
2019-02-11
- 22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
- 22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
- 22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
- 22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
- 22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
- 22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
- 22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
- 22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
- 20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
- 19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 (T107878)
- 18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 (T107878)
- 18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos (T107878)
- 18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 (T107878)
- 18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos (T107878)
- 18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 (T107878)
- 18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos (T107878)
- 10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1
2019-02-08
- 19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
- 18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for T210829.
- 13:49 gtirloni: upgraded all packages in SGE cluster
- 12:25 arturo: install aptitude in tools-sgebastion-06
- 11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - T215272
- 01:07 bd808: Creating tools-sgebastion-07
2019-02-07
- 23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
- 20:18 gtirloni: cleared mail queue on tools-mail-02
- 08:41 godog: upgrade prometheus-02 to prometheus 2.6 - T215272
2019-02-04
- 13:20 arturo: T215154 another reboot for tools-sgebastion-06
- 12:26 arturo: T215154 another reboot for tools-sgebastion-06. Puppet is disabled
- 11:38 arturo: T215154 reboot tools-sgebastion-06 to totally refresh systemd status
- 11:36 arturo: T215154 manually install systemd 239 in tools-sgebastion-06
2019-01-30
- 23:54 gtirloni: cleared apt cache on sge* hosts
2019-01-25
- 20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch (T214668)
- 14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for T214447
- 14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for T214447
2019-01-24
- 11:09 arturo: T213421 delete tools-services-01/02
- 09:46 arturo: T213418 delete tools-docker-registry-02
- 09:45 arturo: T213418 delete tools-docker-builder-05 and tools-docker-registry-01
- 03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01
2019-01-23
- 22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image (T214519)
- 22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image (T214519)
- 21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance (T214519)
- 20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon (T214519)
- 19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
- 17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 (T211684)
2019-01-22
- 20:21 gtirloni: published new docker images (all)
- 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs
2019-01-21
- 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet
2019-01-18
- 21:22 bd808: Forcing php-igbinary update via clush for T213666
2019-01-17
- 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
- 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
- 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
- 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
- 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
- 17:16 arturo: T213421 shutdown tools-services-01/02. Will delete VMs after a grace period
- 12:54 arturo: add webservice security group to tools-sge-services-03/04
2019-01-16
- 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
- 16:38 arturo: T213418 shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
- 14:34 arturo: T213418 point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
- 14:24 arturo: T213418 allocate floating IPs for tools-docker-registry-03 & 04
2019-01-15
- 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
- 18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
- 18:29 bstorm_: T213711 installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
- 14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
- 14:21 arturo: T213418 put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`
2019-01-14
- 22:03 bstorm_: T213711 Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
- 22:03 bstorm_: T213711 Added ports needed for etcd-flannel to work on the etcd security group in eqiad
- 21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
- 21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
- 21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
- 21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
- 16:44 arturo: T213418 docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
- 14:00 arturo: T213421 disable updatetools in the new services nodes while building them
- 13:53 arturo: T213421 delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
- 13:47 arturo: T213421 create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`
2019-01-11
- 11:55 arturo: T213418 shutdown tools-docker-builder-05, will give a grace period before deleting the VM
- 10:51 arturo: T213418 created tools-docker-builder-06 in eqiad1
- 10:46 arturo: T213418 migrating tools-docker-registry-02 from eqiad to eqiad1
2019-01-10
- 22:45 bstorm_: T213357 - Added 24 lighttpd nodes tot he new grid
- 18:54 bstorm_: T213355 built and configured two more generic web nodes for the new grid
- 10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
- 00:12 bstorm_: T213353 Added 36 exec nodes to the new grid
2019-01-09
- 20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
- 17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
- 14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
- 09:59 gtirloni: rebooted tools-checker-01 (T213252)
2019-01-07
- 17:21 bstorm_: T67777 - set the max_u_jobs global grid config setting to 50 in the new grid
- 15:54 bstorm_: T67777 Set stretch grid user job limit to 16
- 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.
2019-01-06
- 22:06 bd808: Added floating ip to tools-sgebastion-06 (T212360)
2019-01-05
- 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.
2019-01-04
- 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history
2019-01-03
- 21:03 bd808: Enabled Puppet on tools-proxy-02
- 20:53 bd808: Disabled Puppet on tools-proxy-02
- 20:51 bd808: Enabled Puppet on tools-proxy-01
- 20:49 bd808: Disabled Puppet on tools-proxy-01
2018-12-21
- 16:29 andrewbogott: migrating tools-exec-1416 to labvirt1004
- 16:01 andrewbogott: moving tools-grid-master to labvirt1004
- 00:35 bd808: Installed tools-manifest 0.14 for T212390
- 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for T212390
- 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for T212390
- 00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for T212390
2018-12-20
- 20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
- 20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
- 20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002
2018-12-17
- 22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - T212153
- 19:18 gtirloni: decreased nfs-mount-manager verbosity (T211817)
- 19:02 arturo: T211977 add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
- 13:46 arturo: T211977 `aborrero@tools-services-01:~$ sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`
2018-12-11
- 13:19 gtirloni: Removed BigBrother (T208357)
2018-12-05
- 12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster (T196973)
2018-12-04
- 22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage T164123
- 20:03 gtirloni: removed floating IPs from tools-k8s-master-01 (T164123)
2018-12-01
- 02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 (T194615)
- 00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts
2018-11-30
- 23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
- 22:18 gtirloni: Pushed new jdk8 docker image based on stretch (T205774)
- 18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance (T194615)
2018-11-27
- 17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb
2018-11-26
- 17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) (T210190)
- 17:34 gtirloni: T186571 removed legofan4000 user from project-tools group (again)
- 13:31 gtirloni: deleted instance tools-clushmaster-01 (T209701)
2018-11-20
- 23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
- 14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
- 13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
- 10:52 arturo: T208579 distributing now misctools and jobutils 1.33 in all aptly repos
- 09:43 godog: restart prometheus@tools on prometheus-01
2018-11-16
- 21:16 bd808: Ran grid engine orphan process kill script from T153281. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
- 17:47 gtirloni: deleted tools-mail instance
- 17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
- 17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
- 13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades
2018-11-14
- 17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
- 17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
- 17:15 andrewbogott: moving tools-exec-1420 to labvirt1009
2018-11-13
- 17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo (T207970)
- 13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
- 13:29 gtirloni: Changed active mail relay to tools-mail-02 (T209356)
- 13:22 arturo: T207970 misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
- 13:05 arturo: T207970 there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
- 12:59 arturo: the puppet issue has been solved by reverting the code
- 12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit
2018-11-08
- 18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
- 17:58 arturo: installing jobutils and misctools v1.32 (T207970)
- 17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
- 16:56 gtirloni: cleaned up /tmp on tools-bastion-05
- 16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
- 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
- 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
- 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
- 16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
- 16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
- 16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
- 11:32 gtirloni: removed temporary /var/mail fix (T208843)
2018-11-07
- 10:37 gtirloni: removed invalid apt.conf.d file from all hosts (T110055)
2018-11-02
- 18:11 arturo: T206223 some disturbances due to the certificate renewal
- 17:04 arturo: renewing *.wmflabs.org T206223
2018-10-31
- 18:02 gtirloni: truncated big .err and error.log files
- 13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde
2018-10-29
- 17:00 bd808: Ran grid engine orphan process kill script from T153281
2018-10-26
- 10:34 arturo: T207970 added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
- 10:32 arturo: T209970 added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
2018-10-19
- 14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
- 00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017
2018-10-18
- 19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017
2018-10-16
- 15:13 bd808: (repost for gtirloni) T186571 removed legofan4000 user from project-tools group (leftover from T165624 legofan4000->macfan4000 rename)
2018-10-07
- 21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 T194859
- 21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
- 21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens
2018-09-21
- 12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
- 12:14 arturo: T205078 same for {jessie,stretch}-wikimedia
- 12:12 arturo: T205078 upgrade trusty-wikimedia packages (git-fat, debmonitor)
- 11:57 arturo: T205078 purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines
2018-09-17
- 09:13 arturo: T204481 aborrero@tools-mail:~$ sudo exiqgrep -i | xargs sudo exim -Mrm
2018-09-14
- 11:22 arturo: T204267 stop the corhist tool (k8s) because is hammering the wikidata API
- 10:51 arturo: T204267 stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API
2018-09-08
- 10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog (T196137)
2018-09-07
- 05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb
2018-08-27
- 23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` T202932
- 23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
- 22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` T202932
2018-08-22
- 13:02 arturo: I used this command: `sudo exim -bp | sudo exiqgrep -i | xargs sudo exim -Mrm`
- 13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com
2018-08-19
- 09:12 legoktm: rebuilding python/base k8s images for https://gerrit.wikimedia.org/r/453665 (T202218)
2018-08-14
- 21:02 legoktm: rebuilt php7.2 docker images for https://gerrit.wikimedia.org/r/452755
- 01:08 legoktm: switched tools.coverme and tools.wikiinfo to use PHP 7.2
2018-08-13
- 23:31 legoktm: rebuilding docker images for webservice upgrade
- 23:16 legoktm: published toollabs-webservice_0.41_all.deb
- 23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice
2018-08-09
- 10:40 arturo: T201602 upgrade packages from jessie-backports (excluding python-designateclient)
- 10:30 arturo: T201602 upgrade packages from jessie-wikimedia
- 10:27 arturo: T201602 upgrade packages from trusty-updates
2018-08-08
- 10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images T156626 T148872 T158244
2018-08-06
- 12:33 arturo: T197176 installing texlive-full in toolforge
2018-08-01
- 14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break
2018-07-30
- 20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
- 04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools
2018-07-27
- 04:52 zhuyifei1999_: rebuilding python/base docker container T190274
2018-07-25
- 19:02 chasemp: tools-worker-1004 reboot
- 19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)
2018-07-18
- 13:24 arturo: upgrading packages from `stretch-wikimedia` T199905
- 13:18 arturo: upgrading packages from `stable` T199905
- 12:51 arturo: upgrading packages from `oldstable` T199905
- 12:31 arturo: upgrading packages from `trusty-updates` T199905
- 12:16 arturo: upgrading packages from `jessie-wikimedia` T199905
- 12:09 arturo: upgrading packages from `trusty-wikimedia` T199905
2018-06-30
- 18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
- 16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
- 16:39 zhuyifei1999_: reboot tools-paws-master-01
- 16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
- 16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere
2018-06-29
- 17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
- 17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU (T123121)
- 16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. T182070
2018-06-28
- 19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
- 18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
- 17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
- 17:20 chasemp: tools-worker-1007:~# /sbin/reboot
- 16:48 arturo: rebooting tools-docker-registry-01
- 16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
- 16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck
2018-06-21
- 13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash
2018-06-20
- 15:09 bd808: Killed orphan processes on webgrid nodes (T182070); most owned by jembot and croptool
2018-06-14
- 14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash
2018-06-11
- 10:11 arturo: T196137 `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null | grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart || true'`
2018-06-08
- 07:46 arturo: T196137 more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes
2018-06-07
- 11:01 arturo: T196137 force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`
2018-06-06
- 22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt (T196589)
- 21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
- 20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
- 19:04 chasemp: tools-bastion-03 is virtually unusable
- 09:49 arturo: T196137 aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid
2018-06-05
- 18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben (T196486)
- 17:39 arturo: T196137 clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
- 17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs (T196486)
2018-06-04
- 10:28 arturo: T196006 installing sqlite3 package in exec nodes
2018-06-03
- 10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' T195834
2018-05-31
- 11:31 zhuyifei1999_: building & pushing python/web docker image T174769
- 11:13 zhuyifei1999_: force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101
2018-05-30
- 10:52 zhuyifei1999_: undid both changes to tools-bastion-05
- 10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
- 10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close T195834
2018-05-28
- 12:09 arturo: T194665 adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
- 12:06 arturo: T194665 adding mono packages to apt.wikimedia.org for trusty-wikimedia
2018-05-25
- 05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty T195558
2018-05-22
- 11:53 arturo: running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for T194665 (mono framework update)
2018-05-18
- 16:36 bd808: Restarted bigbrother on tools-services-02
2018-05-16
- 21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds
2018-05-15
- 04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414. It's hanging for unknown reasons.
- 04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
- 04:05 zhuyifei1999_: Force deletion of grid job 5221417 (tools.giftbot sga), host tools-exec-1414 not responding
2018-05-12
- 10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop | T194343
2018-05-11
- 14:34 andrewbogott: repooling labvirt1001 tools instances
- 13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for T194258: tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
2018-05-10
- 18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update
2018-05-09
- 21:11 Reedy: Added Tim Starling as member/admin
2018-05-07
- 21:02 zhuyifei1999_: re-building all docker images T190893
- 20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 T190893
- 00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours
2018-05-05
- 23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing
2018-05-03
- 14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package T192566
2018-05-01
- 14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)
2018-04-27
- 18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
- 18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker
2018-04-23
- 14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools T192732
2018-04-22
- 13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -E " 1 " | grep php-cgi | xargs sudo kill -9'`
2018-04-15
- 17:51 zhuyifei1999_: forced puppet puns across tools-elastic-0[1-3] T192224
- 17:45 zhuyifei1999_: granted elasticsearch credentials to tools.flaky-ci T192224
2018-04-11
- 13:25 chasemp: cleanup exim frozen messages in an effort to aleve queue pressure
2018-04-06
- 16:30 chicocvenancio: killed job in bastion, tools.gpy affected
- 14:30 arturo: add puppet class `toollabs::apt_pinning` to tools-puppetmaster-01 using horizon, to add some apt pinning related to T159254
- 11:23 arturo: manually upgrade apache2 on tools-puppemaster for T159254
2018-04-05
- 18:46 chicocvenancio: killed wget that was hogging io
2018-03-29
- 20:09 chicocvenancio: killed interactive processes in tools-bastion-03
- 19:56 chicocvenancio: several interactive jobs running in bastion-03. I am writing to connected users and will kill the jobs once done
2018-03-28
- 13:06 zhuyifei1999_: SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid
2018-03-26
- 21:34 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
2018-03-23
- 23:26 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
- 19:43 bd808: tools-proxy-* Forced puppet run to apply https://gerrit.wikimedia.org/r/#/c/421472/
2018-03-22
- 22:04 bd808: Forced puppet run on tools-proxy-02 for T130748
- 21:52 bd808: Forced puppet run on tools-proxy-01 for T130748
- 21:48 bd808: Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout
- 03:50 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
2018-03-21
- 17:50 bd808: Cleaned up stale /project/.system/bigbrother.scoreboard.* files from labstore1004
- 01:09 bd808: Deleting /tmp files owned by tools.wsexport with -mtime +2 across grid (T190185)
2018-03-20
- 08:28 zhuyifei1999_: unmount dumps & remount on tools-bastion-02 (can someone clush this?) T189018 T190126
2018-03-19
- 11:02 arturo: reboot tools-exec-1408, to balance load. Server is unresponsive due to high load by some tools
2018-03-16
- 22:44 zhuyifei1999_: suspended process 22825 (BotOrderOfChapters.exe) on tools-bastion-03. Threads continuously going to D-state & R-state. Also sent message via $ write on pts/10
- 12:13 arturo: reboot tools-webgrid-lighttpd-1420 due to almost full /tmp
2018-03-15
- 16:56 zhuyifei1999_: granted elasticsearch credentials to tools.denkmalbot T185624
2018-03-14
- 20:57 bd808: Upgrading elasticsearch on tools-elastic-01 (T181531)
- 20:53 bd808: Upgrading elasticsearch on tools-elastic-02 (T181531)
- 20:51 bd808: Upgrading elasticsearch on tools-elastic-03 (T181531)
- 12:07 arturo: reboot tools-webgrid-lighttpd-1415, almost full /tmp
- 12:01 arturo: repool tools-webgrid-lighttpd-1421, /tmp is now empty
- 11:56 arturo: depool tools-webgrid-lighttpd-1421 for reboot due to /tmp almost full
2018-03-12
- 20:09 madhuvishy: Run clush -w @all -b 'sudo umount /mnt/nfs/labstore1003-scratch && sudo mount -a' to remount scratch across all of tools
- 17:13 arturo: T188994 upgrading packages from `stable`
- 16:53 arturo: T188994 upgrading packages from stretch-wikimedia
- 16:33 arturo: T188994 upgrading packages form jessie-wikimedia
- 14:58 zhuyifei1999_: building, publishing, and deploying misctools 1.31 5f3561e T189430
- 13:31 arturo: tools-exec-1441 and tools-exec-1442 rebooted fine and are repooled
- 13:26 arturo: depool tools-exec-1441 and tools-exec-1442 for reboots
- 13:19 arturo: T188994 upgrade packages from jessie-backports in all jessie servers
- 12:49 arturo: T188994 upgrade packages from trusty-updates in all ubuntu servers
- 12:34 arturo: T188994 upgrade packages from trusty-wikimedia in all ubuntu servers
2018-03-08
- 16:05 chasemp: tools-clushmaster-01:~$ clush -g all 'sudo puppet agent --test'
- 14:02 arturo: T188994 upgrading trusty-tools packages in all the cluster, this includes jobutils, openssh-server and openssh-sftp-server
2018-03-07
- 20:42 chicocvenancio: killed io intensive recursive zip of huge folder
- 18:30 madhuvishy: Killed php-cgi job run by user 51242 on tools-webgrid-lighttpd-1413
- 14:08 arturo: just merged NFS package pinning https://gerrit.wikimedia.org/r/#/c/416943/
- 13:47 arturo: deploying more apt pinnings: https://gerrit.wikimedia.org/r/#/c/416934/
2018-03-06
- 16:15 madhuvishy: Reboot tools-docker-registry-02 T189018
- 15:50 madhuvishy: Rebooting tools-worker-1011
- 15:08 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs
- 15:03 arturo: drain and reboot tools-worker-1011
- 15:03 chasemp: rebooted tools-worker 1001-1008
- 14:58 arturo: drain and reboot tools-worker-1010
- 14:27 chasemp: multiple tools running on k8s workers report issues reading replica.my.cnf file atm
- 14:27 chasemp: reboot tools-worker-100[12]
- 14:23 chasemp: downtime icinga alert for k8s workers ready
- 13:21 arturo: T188994 in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced
- 12:58 arturo: T188994 upgrading packages in jessie nodes from the oldstable source
- 11:42 arturo: clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem
- 11:41 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide
- 11:36 arturo: (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic (T188911)
- 11:33 arturo: removing unused kernel packages in ubuntu nodes
- 11:08 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster
2018-03-05
- 18:56 zhuyifei1999_: also published jobutils_1.30_all.deb
- 18:39 zhuyifei1999_: built and published misctools_1.30_all.deb T167026 T181492
- 14:33 arturo: delete `linux-image-4.9.0-6-amd64` package from stretch instances for T188911
- 14:01 arturo: deleting old kernel packages in jessie instances for T188911
- 13:58 arturo: running `apt-get autoremove` with clush in all jessie instances
- 12:16 arturo: apply role::toollabs::base to tools-paws prefix in horizon for T187193
- 12:10 arturo: apply role::toollabs::base to tools-prometheus prefix in horizon for T187193
2018-03-02
- 13:41 arturo: doing some testing with puppet classes in tools-package-builder-01 via horizon
2018-03-01
- 13:27 arturo: deploy https://gerrit.wikimedia.org/r/#/c/415057/
2018-02-27
- 17:37 chasemp: add chico as admin to toolsbeta
- 12:23 arturo: running `apt-get autoclean` in canary servers
- 12:16 arturo: running `apt-get autoremove` in canary servers
2018-02-26
- 19:17 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --test"
- 10:35 arturo: enable puppet in tools-proxy-01
- 10:23 arturo: disable puppet in tools-proxy-01 for apt pinning tests
2018-02-25
- 19:04 chicocvenancio: killed jobs in tools-bastion-03, wrote notice to tools owners' terminals
2018-02-23
- 19:11 arturo: enable puppet in tools-proxy-01
- 18:53 arturo: disable puppet in tools-proxy-01 for apt preferences testing
- 13:52 arturo: deploying https://gerrit.wikimedia.org/r/#/c/413725/ across the fleet
- 13:04 arturo: install apt-rdepends in tools-paws-master-01 which triggered some python libs to be upgraded
2018-02-22
- 16:31 bstorm_: Enabled puppet on tools-static-12 as the test server
2018-02-21
- 19:02 bstorm_: disabled puppet on tools-static-* pending change 413197
- 18:15 arturo: puppet should be fine across the fleet
- 17:24 arturo: another try: merged https://gerrit.wikimedia.org/r/#/c/413202/
- 17:02 arturo: revert last change https://gerrit.wikimedia.org/r/#/c/413198/
- 16:59 arturo: puppet is broken across the cluster due to last change
- 16:57 arturo: deploying https://gerrit.wikimedia.org/r/#/c/410177/
- 16:26 bd808: Rebooting tools-docker-registry-01, NFS mounts are in a bad state
- 11:43 arturo: package upgrades in tools-webgrid-lightttpd-1401
- 11:35 arturo: package upgrades in tools-package-builder-01 tools-prometheus-01 tools-static-10 and tools-redis-1001
- 11:22 arturo: package upgrades in tools-mail, tools-grid-master, tool-logs-02
- 10:51 arturo: package upgrades in tools-checker-01 tools-clushmaster-01 and tools-docker-builder-05
- 09:18 chicocvenancio: killed io intensive tool job in bastion
- 03:32 zhuyifei1999_: removed /data/project/.elasticsearch.ini, owned by root and mode 644, leaks the creds of /data/project/strephit/.elasticsearch.ini Might need to cycle it as well...
2018-02-20
- 12:42 arturo: upgrading tools-flannel-etcd-01
- 12:42 arturo: upgrading tools-k8s-etcd-01
2018-02-19
- 19:13 arturo: upgrade all packages of tools-services-01
- 19:02 arturo: move tools-services-01 from puppet3 to puppet4 (manual package upgrade). No issues detected.
- 18:23 arturo: upgrade packages of tools-cron-01 from all channels (trusty-wikimedia, trusty-updates and trusty-tools)
- 12:54 arturo: puppet run with clush to ensure puppet is back to normal after being broken due to duplicated python3 declaration
2018-02-16
- 18:21 arturo: upgrading tools-proxy-01 and tools-paws-master-01, same as others
- 17:36 arturo: upgrading oldstable, jessie-backports, jessie-wikimedia packages in tools-k8s-master-01 (excluding linux*, libpam*, nslcd)
- 13:00 arturo: upgrades in tools-exec-14[01-10].eqiad.wmflabs were fine
- 12:42 arturo: aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'DEBIAN_FRONTEND=noninteractive sudo apt-upgrade -u upgrade trusty-updates -y'
- 11:58 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-wikimedia -y
- 11:57 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-backports -y
- 11:53 arturo: (10 exec canary nodes) aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'sudo apt-upgrade -u upgrade trusty-wikimedia -y'
- 11:41 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade oldstable -y
2018-02-15
- 13:54 arturo: cleanup ferm (deinstall) in tools-services-01 for T187435
- 13:28 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-tools
- 13:16 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-updates -y
- 13:13 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
- 13:06 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-tools
- 12:57 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-updates
- 12:51 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
2018-02-14
- 13:09 arturo: the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment (T187315)
- 13:04 arturo: reboot tools-paws-master-01 for T187315
2018-02-11
- 01:28 zhuyifei1999_: `# find /home/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected: only /home/tr8dr, mode 0777 -> 0775
- 01:21 zhuyifei1999_: `# find /data/project/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected tools: wikisource-tweets, gsociftttdev, dow, ifttt-testing, elobot. All mode 2777 -> 2775
2018-02-09
- 10:35 arturo: deploy https://gerrit.wikimedia.org/r/#/c/409226/ T179343 T182562 T186846
- 06:15 bd808: Killed orphan processes owned by iabot, dupdet, and wsexport scattered across the webgrid nodes
- 05:07 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1426
- 05:06 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1411
- 05:05 bd808: Killed 1 orphan php-fcgi process from jembot that were running on tools-webgrid-lighttpd-1409
- 05:02 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1421 and pegging the cpu there
- 04:56 bd808: Rescheduled 30 of the 60 tools running on tools-webgrid-lighttpd-1421 (T186830)
- 04:39 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1417 and pegging the cpu there
2018-02-08
- 18:38 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1002.tools.eqiad.wmflabs
- 18:35 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade jessie-wikimedia -v
- 18:33 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade oldstable -v
- 18:28 arturo: cordon & drain tools-worker-1002.tools.eqiad.wmflabs
- 18:10 arturo: uncordon tools-paws-worker-1019. Package upgrades were OK.
- 18:08 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stable -v
- 18:06 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stretch-wikimedia -v
- 18:02 arturo: cordon tools-paws-worker-1019 to do some package upgrades
- 17:29 arturo: repool tools-exec-1401.tools.eqiad.wmflabs. Package upgrades were OK.
- 17:20 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-updates -vy
- 17:15 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-wikimedia -vy
- 17:11 arturo: depool tools-exec-1401.tools.eqiad.wmflabs to do some package upgrades
- 14:22 arturo: it was some kind of transient error. After a second puppet run across the fleet, all seems fine
- 13:53 arturo: deploy https://gerrit.wikimedia.org/r/#/c/407465/ which is causing some puppet issues. Investigating.
2018-02-06
- 13:15 arturo: deploy https://gerrit.wikimedia.org/r/#/c/408529/ to tools-services-01
- 13:05 arturo: unpublish/publish trusty-tools repo
- 13:03 arturo: install aptly v0.9.6-1 in tools-services-01 for T186539 after adding it to trusty-tools repo (self contained)
2018-02-05
- 17:58 arturo: publishing/unpublishing trusty-tools repo in tools-services-01 to address T186539
- 13:27 arturo: for the record, not a single warning or error (orange/red messages) in puppet in the toolforge cluster
- 13:06 arturo: deploying fix for T186230 using clush
2018-02-03
- 01:04 chicocvenancio: killed io intensive process in bastion-03 "vltools python3 ./broken_ref_anchors.py"
2018-01-31
- 22:54 chasemp: add bstorm to sudoers as root
2018-01-29
- 20:02 chasemp: add zhuyifei1999_ tools root for T185577
- 20:01 chasemp: blast a puppet run to see if any errors are persistent
2018-01-28
- 22:49 chicocvenancio: killed compromised session generating miner processes
- 22:48 chicocvenancio: killed miner processes in tools-bastion-03
2018-01-27
- 00:55 arturo: at tools-static-11 the kernel OOM killer stopped git gc at about 20% :-(
- 00:25 arturo: (/srv is almost full) aborrero@tools-static-11:/srv/cdnjs$ sudo git gc --aggressive
2018-01-25
- 23:47 arturo: fix last deprecation warnings in tools-elastic-03, tools-elastic-02, tools-proxy-01 and tools-proxy-02 by replacing by hand configtimeout with http_configtimeout in /etc/puppet/puppet.conf
- 23:20 arturo: T179386 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
- 05:25 arturo: deploying misctools and jobutils 1.29 for T179386
2018-01-23
- 19:41 madhuvishy: Add bstorm to project admins
- 15:48 bd808: Admin clean up; removed Coren, Ryan Lane, and Springle.
- 14:17 chasemp: add me, arturo, chico to sudoers and removed marc
2018-01-22
- 18:32 arturo: T181948 T185314 deploying jobutils and misctools v1.28 in the cluster
- 11:21 arturo: puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790
- 10:31 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v' <--- check again how is the cluster with puppet
- 10:18 arturo: T181948 deploy misctools 1.27 in the cluster
2018-01-19
- 17:32 arturo: T185314 deploying new version of jobutils 1.27
- 12:56 arturo: the puppet status across the fleet seems good, only minor things like T185314 , T179388 and T179386
- 12:39 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
2018-01-18
- 16:11 arturo: aborrero@tools-clushmaster-01:~$ sudo aptitude purge vblade vblade-persist runit (for something similar to T182781)
- 15:42 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
- 13:52 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -f 1 -w @all 'sudo facter | grep lsbdistcodename | grep trusty && sudo apt-upgrade trusty-wikimedia -v'
- 13:44 chasemp: upgrade wikimedia packages on tools-bastion-05
- 12:24 arturo: T178717 aborrero@tools-exec-1401:~$ sudo apt-upgrade trusty-wikimedia -v
- 12:11 arturo: T178717 aborrero@tools-webgrid-generic-1402:~$ sudo apt-upgrade trusty-wikimedia -v
- 11:42 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
2018-01-17
- 18:47 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions | grep upgradeable | grep trusty-wikimedia' | tee pending-upgrades-report-trusty-wikimedia.txt
- 17:55 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' | tee pending-upgrades-report.txt
- 15:15 andrewbogott: running purge-old-kernels on all Trusty exec nodes
- 15:15 andrewbogott: repooling exec-manage tools-exec-1430.
- 15:04 andrewbogott: depooling exec-manage tools-exec-1430. Experimenting with purge-old-kernels
- 14:09 arturo: T181647 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
2018-01-16
- 22:01 chasemp: qstat -explain E -xml | grep 'name' | sed 's/<name>//' | sed 's/<\/name>//' | xargs qmod -cq
- 21:54 chasemp: tools-exec-1436:~$ /sbin/reboot
- 21:24 andrewbogott: repooled tools-exec-1420 and tools-webgrid-lighttpd-1417
- 21:14 andrewbogott: depooling tools-exec-1420 and tools-webgrid-lighttpd-1417
- 20:58 andrewbogott: depooling tools-exec-1412, 1415, 1417, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
- 20:56 andrewbogott: repooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
- 20:46 andrewbogott: depooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
- 20:46 andrewbogott: repooled tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
- 20:33 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
- 20:20 andrewbogott: depooling tools-webgrid-lighttpd-1412 and tools-exec-1423 for host reboot
- 20:19 andrewbogott: repooled tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
- 20:02 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
- 20:00 andrewbogott: depooled and repooled tools-webgrid-lighttpd-1427 tools-webgrid-lighttpd-1428 tools-exec-1413 tools-exec-1442 for host reboot
- 18:50 andrewbogott: switched active proxy back to tools-proxy-02
- 18:50 andrewbogott: repooling tools-exec-1422 and tools-webgrid-lighttpd-1413
- 18:34 andrewbogott: moving proxy from tools-proxy-02 to tools-proxy-01
- 18:31 andrewbogott: depooling tools-exec-1422 and tools-webgrid-lighttpd-1413 for host reboot
- 18:26 andrewbogott: repooling tools-exec-1404 and 1434 for host reboot
- 18:06 andrewbogott: depooling tools-exec-1404 and 1434 for host reboot
- 18:04 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
- 17:48 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
- 17:28 andrewbogott: disabling tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403, tools-exec-1403 for host reboot
- 17:26 andrewbogott: repooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 after host reboot
- 17:08 andrewbogott: depooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 for host reboot
- 16:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 after host reboot
- 15:52 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 for host reboot
- 13:35 chasemp: tools-mail almouked@ltnet.net 719 pending messages cleared
2018-01-11
- 20:33 andrewbogott: repooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
- 20:33 andrewbogott: uncordoning tools-worker-1012 and tools-worker-1017
- 20:06 andrewbogott: cordoning tools-worker-1012 and tools-worker-1017
- 20:02 andrewbogott: depooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
- 19:00 chasemp: reboot tools-worker-1015
- 15:08 chasemp: reboot tools-exec-1405
- 15:06 chasemp: reboot tools-exec-1404
- 15:06 chasemp: reboot tools-exec-1403
- 15:02 chasemp: reboot tools-exec-1402
- 14:57 chasemp: reboot tools-exec-1401 again...
- 14:53 chasemp: reboot tools-exec-1401
- 14:46 chasemp: install metltdown kernel and reboot workers 1011-1016 as jessie pilot
2018-01-10
- 15:14 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
- 15:03 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1016`; do kubectl cordon $n; done
- 14:41 chasemp: tools-clushmaster-01:~$ clush -w @k8s-worker "sudo puppet agent --disable 'chase rollout'"
- 14:01 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1001.tools.eqiad.wmflabs
- 13:57 arturo: T184604 cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs
- 13:46 arturo: T184604 aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs
- 13:45 arturo: T184604 aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes
- 13:26 arturo: sudo kubectl drain tools-worker-1020.tools.eqiad.wmflabs
- 13:22 arturo: empty by hand syslog and daemon.log files. They are so big that logrotate won't handle them
- 13:20 arturo: aborrero@tools-worker-1020:~$ sudo service kubelet restart
- 13:18 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for T184604
- 13:13 arturo: detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened T184604
2018-01-09
- 23:21 yuvipanda: paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading
- 23:08 yuvipanda: turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1
- 23:01 yuvipanda: kill paws master and reboot it
- 22:54 yuvipanda: kill all kube-system pods in paws cluster
- 22:54 yuvipanda: kill all PAWS pods
- 22:53 yuvipanda: redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason
- 22:49 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8
- 22:48 yuvipanda: run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash to setup kubeadm on all paws worker nodes
- 22:46 yuvipanda: reboot all paws-worker nodes
- 22:46 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
- 22:46 madhuvishy: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
- 21:17 chasemp: ...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
- 21:17 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test"
- 21:10 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done
- 20:55 chasemp: for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016`; do kubectl cordon $n; done
- 20:51 chasemp: kubectl cordon tools-worker-1001.tools.eqiad.wmflabs
- 20:15 chasemp: disable puppet on proxies and k8s workers
- 19:50 chasemp: clush -w @all 'sudo puppet agent --test'
- 19:42 chasemp: reboot tools-worker-1010
2018-01-08
- 20:34 madhuvishy: Restart kube services and uncordon tools-worker-1001
- 19:26 chasemp: sudo service docker restart; sudo service flannel restart; sudo service kube-proxy restart on tools-proxy-02
2018-01-06
- 00:35 madhuvishy: Run `clush -w @paws-worker -b 'sudo iptables -L FORWARD'`
- 00:05 madhuvishy: Drain and cordon tools-worker-1001 (for debugging the dns outage)
2018-01-05
- 23:49 madhuvishy: Run clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart' on tools-clushmaster-01
- 16:22 andrewbogott: moving tools-worker-1027 to labvirt1015 (CPU balancing)
- 16:01 andrewbogott: moving tools-worker-1017 to labvirt1017 (CPU balancing)
- 15:32 andrewbogott: moving tools-exec-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
- 15:18 andrewbogott: moving tools-exec-1411.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
- 15:02 andrewbogott: moving tools-exec-1440.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
- 14:47 andrewbogott: moving tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
- 14:25 andrewbogott: moving tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
- 14:05 andrewbogott: moving tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
- 13:46 andrewbogott: moving tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
- 05:33 andrewbogott: migrating tools-worker-1012 to labvirt1017 (CPU load balancing)
2018-01-04
- 17:24 andrewbogott: rebooting tools-paws-worker-1019 to verify repair of T184018
2018-01-03
- 15:38 bd808: Forced Puppet run on tools-services-01
- 11:29 arturo: deploy https://gerrit.wikimedia.org/r/#/c/401716/ and https://gerrit.wikimedia.org/r/394101 using clush