You are browsing a read-only backup copy of Wikitech. The primary site can be found at


From Wikitech-static
Revision as of 14:42, 12 May 2022 by imported>Razzi (→‎Random notes)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Learning the Wikimedia stack! Trying to be the most versatile SRE at the wikimedia foundation! For now learning data engineering tools: kafka, spark, scala, mysql. See grand plan here

<InputBox> type=create placeholder=Article name prefix=User:Razzi/ buttonlabel=Create user article </InputBox>
<inputbox> type=create prefix=User:Razzi/ default=2022-10-06 buttonlabel=Create article for day </inputbox>
<inputbox> prefix=User:Razzi/scratch type=commenttitle page=User:Razzi buttonlabel=New section on scratch page </inputbox>


How to install java 8 on debian (TODO)


User account "Razzi" is not registered.

No changes were found matching these criteria.

Article list

Lists (


What are the "unauthenticated user" users in show processlist seen in

razzi@clouddb1014:~$ sudo mysql -S /var/run/mysqld/mysqld.s7.sock  -e 'show processlist;'

How does refine use salts?

Is /system a default directory for hadoop, or can we remove it?

Is there a place that lists the vlans?

How to check vlan for a host?

Q: Is it expected that when reimaging a host, we see the old name when running homer?

[edit interfaces interface-range disabled]
-    member ge-1/0/13;
[edit interfaces interface-range vlan-analytics1-d-eqiad]
+    member ge-1/0/13;
     member ge-1/0/43 { ... }
[edit interfaces]
+   ge-1/0/13 {
+       description "db1125 {#2221}";
+   }

^ this is while decommissioning db1125

A: No, I skipped some netbox steps; when I fixed them this didn't show up

Q: How to submit a test job to the yarn queue to test if it is accepting jobs?

Q: What to do about this warning on analytics1068?

May 06 21:03:35 analytics1068 systemd[1]: /run/systemd/generator.late/hadoop-yarn-nodemanager.service:18: PIDFile= late/hadoop-yarn-nodemanager.service:18: PIDFile= references path below legacy directory /var/run/, updating /var/run/hadoop-yarn/ → /run/hadoop-yarn/ → /run/hadoop-yarn/; please update the unit file accordingly.

Q: Server Lifecycle#Rename while reimaging when to merge homer patch?

A: homer patch is for firewall, not having to do with the reimaging process. Merge after reimage complete

Q: What is the order for creating puppet patches when it comes to server lifecycle? Some things that might need to be avoided: having site.pp for node that is being decommissioned, having site.pp for node that doesn't exist yet


Script to show what tickets are currently in progress

Add homer-public to codesearch

Change method

    def get_runner(self, args):
        """As specified by Spicerack API."""
        return UpdateWikireplicaViewsRunner(args, self.spicerack)

to work for the 99% case:

runner = UpdateWikireplicaViewsRunner

by defining get_runner on UpdateWikireplicaViews

Random notes

sudo lsof -Xd DEL - lists the files that have been deleted but are still held open by a running process


Get mysql hostname: show variables where Variable_name='hostname';

Why does sshing into mgmt not accept the password?

Because you forgot the `root@` part!

Instead of ssh dbstore1007.mgmt.e

do `ssh root@dbstore1007.mgmt.e`

Or make ssh use the root user in your ~/.ssh/config:

refactor this to run automatically

Why no homer diff?


how to check what vlan a host belongs to?


Proposal: stop using conda for infrastructure

Why not use standard pip?

How to apply hadoop config changes?

For example

linux-host-entries.ttyS0-115200 versus linux-host-entries.ttyS1-115200

a mystery

sudo gnt-instance console an-airflow1002.eqiad.wmnet is stuck, is this normal?

Gotta stop and start, the old reboot trick

sudo gnt-instance stop an-airflow1003.eqiad.wmnet

how to restart services on hadoop coordinator?


Want to restart services for an-test-coord1001 and an-coord*

But how to do this safely?

for all things that you need to restart, it is good to make a mental list of services to restart and what impact they have
on an-coord1001 there are
1) oozie
2) presto coordinator
3) hive server
4) hive metastore
and that's it IIRC
oozie can be restarted anytime, no issue on that front (all the state is on the db)
and we don't really have clients contacting it
the presto coordinator can be restarted anytime, it is quick but it may impact ongoing queries (if any, say from superset)
the hive server/coordinator is a bit more complicated
they are quick to restart, but any client that is using them can be impacted (all oozie jobs, timers, etc..)
so the safe way is to temporary stop timers on launcher, wait for RUNNING jobs to be as few as possible and then restart server and metastore
we have the analytics-hive.eqiad.wmnet that can be used in theory, but when you failover from say an-coord1001 to 1002 the target service is only the hive server
not the metastore
ah wait I am saying something silly
so on an-coord1002 we have both server and metastore
basically a mirror of 1001
what I was misremembering is that the servers use the "local" metastore, but the metastore use a specific database (in this case, the one on an-coord1001)
this is to avoid having a split brain view, we cannot use the db replica on 1002 for the metastore since it doesn't update the master when changed
for hive, just change the DNS of analytics-hive.eqiad.wmnet
to 1002, then wait for the TTL to expire
and you can freely restart daemons on 1001

Set boot order to disk - "upstream is aware" - any issue to track?

Ganeti#Create a VM

Can we delete the hadoop-analytics grafana section now?


Puppet failure on deployment-logstash03.deployment-prep.eqiad.wmflabs

make it stop!!!

Still happening!!!!!!!

"those alerts go to all admins of the deployment-prep cloud vps project, production root@ has nothing to do with them"

what to do about this /mnt/hdfs issue

razzi@an-test-coord1001:~$ sudo lsof -Xd DEL | wc
lsof: WARNING: can't stat() fuse.fuse_dfs file system /mnt/hdfs
      Output information may be incomplete.
     81     649    8819

a lotta deleted files still open on an-test-coord1001...?

razzi@an-test-coord1001:~$ sudo lsof -Xd DEL
lsof: WARNING: can't stat() fuse.fuse_dfs file system /mnt/hdfs
      Output information may be incomplete.
systemd       1       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/
fuse_dfs    813       root DEL    REG  253,0          2107668 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/
fuse_dfs    813       root DEL    REG  253,0          2107706 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jce.jar
fuse_dfs    813       root DEL    REG  253,0          2107712 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jsse.jar
fuse_dfs    813       root DEL    REG  253,0          2107651 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/
fuse_dfs    813       root DEL    REG  253,0          2107661 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/
fuse_dfs    813       root DEL    REG  253,0          2107663 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/
fuse_dfs    813       root DEL    REG  253,0          2107664 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/
fuse_dfs    813       root DEL    REG  253,0          2107690 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/nashorn.jar
fuse_dfs    813       root DEL    REG  253,0          2107685 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/cldrdata.jar
fuse_dfs    813       root DEL    REG  253,0          2107692 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunjce_provider.jar
fuse_dfs    813       root DEL    REG  253,0          2107694 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/zipfs.jar
fuse_dfs    813       root DEL    REG  253,0          2107693 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunpkcs11.jar
fuse_dfs    813       root DEL    REG  253,0          2107689 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/localedata.jar
fuse_dfs    813       root DEL    REG  253,0          2107686 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/dnsns.jar
fuse_dfs    813       root DEL    REG  253,0          2107687 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/icedtea-sound.jar
fuse_dfs    813       root DEL    REG  253,0          2107711 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jfr.jar
fuse_dfs    813       root DEL    REG  253,0          2107718 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar
fuse_dfs    813       root DEL    REG  253,0          2107688 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/jaccess.jar
fuse_dfs    813       root DEL    REG  253,0          2107671 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/
fuse_dfs    813       root DEL    REG  253,0          2107652 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/
fuse_dfs    813       root DEL    REG  253,0          2107670 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/
fuse_dfs    813       root DEL    REG  253,0          2107674 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/
fuse_dfs    813       root DEL    REG  253,0          2107691 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunec.jar
fuse_dfs    813       root DEL    REG  253,0          1971364 /tmp/hsperfdata_root/813
systemd-l  1057       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/
dbus-daem  1069 messagebus DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/
lldpd      1070     _lldpd DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/
lldpd      1070     _lldpd DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/
lldpd      1077     _lldpd DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/
lldpd      1077     _lldpd DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/
mysqld     6126      mysql DEL    REG   0,17           277921 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277920 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277919 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277918 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277917 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277916 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277915 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277914 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277913 /[aio]
mysqld     6126      mysql DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/
mysqld     6126      mysql DEL    REG   0,17           277912 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277911 /[aio]
systemd    9708       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/
(sd-pam)   9709       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/
systemd   10294      oozie DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/
(sd-pam)  10296      oozie DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/
rsyslogd  15874       root DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/
rsyslogd  15874       root DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/
(sd-pam)  29881      razzi DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/
kafkatee  33887   kafkatee DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/
airflow   38223  analytics DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/
airflow   38223  analytics DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/

Ok, unmounting and remounting fixed the jars, the warning is still there

sudo umount /mnt/hdfs
sudo mount -a

homework for you - think about a single cumin command to run the umount/mount commands above on the hosts mounting /mnt/hdfs

Is this going to involve querying for nodes with fuse?

how to make ssh an-coord1001 work without .e?


what is the deal with product data engineering? We have product analytics and we are data engineering


druid administration improvements

add prometheus restarts to cookbook

razzi: there is an extra caveat that I have never had the time to fix, namely the fact that after restarting the clusters a roll restart of the prometheus-druid-exporter services (one for each node) is needed
so the way that we collect metrics is that we force druid to push (via HTTP POST) metrics to a localhost daemon that exposes them as prometheus exporter
when a roll restart happens, the overlord and coordinator leaders will likely change
so the past leaders stop emitting metrics, and due to how prometheus works they keep pushing the last value of their metrics (and not zero or null etc..)
I believe that some fix to the exporter may resolve this
but it has been in my backlog for a long time :)

Add druid cluster docs to wikitech

razzi: analytics is the one used by turnilo superset etc..
public is the one that gets only one dataset loaded, the mw history snapshot, and it is called by the AQS api
(and it is also not in the analytics VLAN, and has a load balancer in front of it)
the cookbook distinguish between the two especially for the pool/depool actions
and yes we need to restart both :)
also zookeeper on both clusters, that can be done running the zookeeper cookbook (there are options for both druid clusters)

Stuff to look into

How to enable yubikey

Run the command here:

sudo modify-mfa --enable UID

Then go to and logout

Then when you log in, it'll prompt and your yubikey will blink; touch it and you're good!

More broccoli (yubikey ssh)

an-druid* versus druid*

Which one is "druid public"?

Look at modules/profile/templates/cumin/aliases.yaml.erb and your questions will be answered...

druid-analytics: P{O:druid::analytics::worker}

druid-public: P{O:druid::public::worker}

# Class: role::druid::public::worker

# Sets up the Druid public cluster for use with AQS and wikistats 2.0.


# Class: role::druid::analytics::worker

# Sets up the Druid analytics cluster for internal use.

# This cluster may contain data not suitable for

# use in public APIs.


staff meeting to watch

unfortunate to have it be on youtube, account switcher disabled if not logged in properly, and distracting "related videos"...

Aug 02 00:15:14 an-launcher1002 java[14538]: unable to create directory '/nonexistent/.cache/dconf': Permission denied. dconf will not work properly.

stop this log spam!

email failing to send

razzi@puppetmaster1001:/srv/private$ ack bsitzmann
1115:        email                 

Should clean this up

release team does training

look into this

could use an article for analytics vlan


Check if a new kernel is installed

ls -l /

razzi@aqs1004:~$ uname -r
razzi@aqs1004:~$ uname -a
Linux aqs1004 4.9.0-13-amd64 #1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 GNU/Linux
razzi@aqs1004:~$ ls -l /

Search all gzipped logs

zgrep linux- /var/log/dpkg.log.*

show http request and response headers

http get -p Hh

show info about a systemctl service

razzi@labstore1006:~$ systemctl status analytics-dumps-fetch-geoeditors_dumps.service
● analytics-dumps-fetch-geoeditors_dumps.service - Copy geoeditors_dumps files from Hadoop HDFS.
   Loaded: loaded (/lib/systemd/system/analytics-dumps-fetch-geoeditors_dumps.service; static; vendor preset: enabled)
   Active: inactive (dead) since Thu 2021-08-19 17:53:32 UTC; 36s ago
  Process: 16187 ExecStart=/usr/local/bin/kerberos-run-command dumpsgen /usr/local/bin/rsync-analytics-geoeditors_dumps (code=exited, status=0/SUCCESS)
 Main PID: 16187 (code=exited, status=0/SUCCESS)

How to get docker to run without sudo

sudo adduser razzi docker

and log out

See something like:

How to ensure a user actually signed L3?

Go to