You are browsing a read-only backup copy of Wikitech. The live site can be found at


From Wikitech-static
Jump to navigation Jump to search

Learning the Wikimedia stack!

<InputBox> type=create placeholder=Article name prefix=User:Razzi/ buttonlabel=Create user article </InputBox>
<inputbox> type=create prefix=User:Razzi/ default=2021-12-08 buttonlabel=Create article for day </inputbox>
<inputbox> type=commenttitle page=User:Razzi buttonlabel=New section on this page </inputbox>


User account "Razzi" is not registered.

No changes were found matching these criteria.

Lists (


How does refine use salts?

Is /system a default directory for hadoop, or can we remove it?

Is there a place that lists the vlans?

How to check vlan for a host?

Q: Is it expected that when reimaging a host, we see the old name when running homer?

[edit interfaces interface-range disabled]
-    member ge-1/0/13;
[edit interfaces interface-range vlan-analytics1-d-eqiad]
+    member ge-1/0/13;
     member ge-1/0/43 { ... }
[edit interfaces]
+   ge-1/0/13 {
+       description "db1125 {#2221}";
+   }

^ this is while decommissioning db1125

A: No, I skipped some netbox steps; when I fixed them this didn't show up

Q: How to submit a test job to the yarn queue to test if it is accepting jobs?

Q: What to do about this warning on analytics1068?

May 06 21:03:35 analytics1068 systemd[1]: /run/systemd/generator.late/hadoop-yarn-nodemanager.service:18: PIDFile= late/hadoop-yarn-nodemanager.service:18: PIDFile= references path below legacy directory /var/run/, updating /var/run/hadoop-yarn/ → /run/hadoop-yarn/ → /run/hadoop-yarn/; please update the unit file accordingly.

Q: Server Lifecycle#Rename while reimaging when to merge homer patch?

A: homer patch is for firewall, not having to do with the reimaging process. Merge after reimage complete

Q: What is the order for creating puppet patches when it comes to server lifecycle? Some things that might need to be avoided: having site.pp for node that is being decommissioned, having site.pp for node that doesn't exist yet


Script to show what tickets are currently in progress

Add homer-public to codesearch

Remove legacy analytics-hadoop from grafana

Random notes

sudo lsof -Xd DEL - lists the files that have been deleted but are still held open by a running process


Why does sshing into mgmt not accept the password?

Because you forgot the `root@` part!

Instead of ssh dbstore1007.mgmt.e

do `ssh root@dbstore1007.mgmt.e`

Or make ssh use the root user in your ~/.ssh/config:

refactor this to run automatically

Why no homer diff?


how to check what vlan a host belongs to?


Proposal: stop using conda for infrastructure

Why not use standard pip?

How to apply hadoop config changes?

For example

linux-host-entries.ttyS0-115200 versus linux-host-entries.ttyS1-115200

a mystery

sudo gnt-instance console an-airflow1002.eqiad.wmnet is stuck, is this normal?

Gotta stop and start, the old reboot trick

sudo gnt-instance stop an-airflow1003.eqiad.wmnet

how to restart services on hadoop coordinator?


Want to restart services for an-test-coord1001 and an-coord*

But how to do this safely?

for all things that you need to restart, it is good to make a mental list of services to restart and what impact they have
on an-coord1001 there are
1) oozie
2) presto coordinator
3) hive server
4) hive metastore
and that's it IIRC
oozie can be restarted anytime, no issue on that front (all the state is on the db)
and we don't really have clients contacting it
the presto coordinator can be restarted anytime, it is quick but it may impact ongoing queries (if any, say from superset)
the hive server/coordinator is a bit more complicated
they are quick to restart, but any client that is using them can be impacted (all oozie jobs, timers, etc..)
so the safe way is to temporary stop timers on launcher, wait for RUNNING jobs to be as few as possible and then restart server and metastore
we have the analytics-hive.eqiad.wmnet that can be used in theory, but when you failover from say an-coord1001 to 1002 the target service is only the hive server
not the metastore
ah wait I am saying something silly
so on an-coord1002 we have both server and metastore
basically a mirror of 1001
what I was misremembering is that the servers use the "local" metastore, but the metastore use a specific database (in this case, the one on an-coord1001)
this is to avoid having a split brain view, we cannot use the db replica on 1002 for the metastore since it doesn't update the master when changed
for hive, just change the DNS of analytics-hive.eqiad.wmnet
to 1002, then wait for the TTL to expire
and you can freely restart daemons on 1001

Set boot order to disk - "upstream is aware" - any issue to track?

Ganeti#Create a VM

Can we delete the hadoop-analytics grafana section now?