You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "User:Razzi"

From Wikitech-static
Jump to navigation Jump to search
imported>Razzi
imported>Razzi
Line 255: Line 255:


tbd
tbd
== what is the deal with product data engineering? We have product analytics and we are data engineering ==
??

Revision as of 15:56, 12 July 2021


Learning the Wikimedia stack!

<InputBox> type=create placeholder=Article name prefix=User:Razzi/ buttonlabel=Create user article </InputBox>
<inputbox> type=create prefix=User:Razzi/ default=2021-12-08 buttonlabel=Create article for day </inputbox>
<inputbox> type=commenttitle page=User:Razzi buttonlabel=New section on this page </inputbox>

Documentation

User account "Razzi" is not registered.

No changes were found matching these criteria.

Lists (https://gtdfh.liw.fi/quickie-overview/)

Questions

How does refine use salts? https://gerrit.wikimedia.org/r/c/operations/puppet/+/679939

Is /system a default directory for hadoop, or can we remove it?

Is there a place that lists the vlans?

How to check vlan for a host?

Q: Is it expected that when reimaging a host, we see the old name when running homer?

[edit interfaces interface-range disabled]
-    member ge-1/0/13;
[edit interfaces interface-range vlan-analytics1-d-eqiad]
+    member ge-1/0/13;
     member ge-1/0/43 { ... }
[edit interfaces]
+   ge-1/0/13 {
+       description "db1125 {#2221}";
+   }

^ this is while decommissioning db1125

A: No, I skipped some netbox steps; when I fixed them this didn't show up

Q: How to submit a test job to the yarn queue to test if it is accepting jobs?

Q: What to do about this warning on analytics1068?

May 06 21:03:35 analytics1068 systemd[1]: /run/systemd/generator.late/hadoop-yarn-nodemanager.service:18: PIDFile= late/hadoop-yarn-nodemanager.service:18: PIDFile= references path below legacy directory /var/run/, updating /var/run/hadoop-yarn/yarn-yarn-nodemanager.pid → /run/hadoop-yarn/yarn-yarn-nodemanager.pid → /run/hadoop-yarn/yarn-yarn-nodemanager.pid; please update the unit file accordingly.

Q: Server Lifecycle#Rename while reimaging when to merge homer patch?

A: homer patch is for firewall, not having to do with the reimaging process. Merge after reimage complete

Q: What is the order for creating puppet patches when it comes to server lifecycle? Some things that might need to be avoided: having site.pp for node that is being decommissioned, having site.pp for node that doesn't exist yet

Ideas

Script to show what tickets are currently in progress

Add homer-public to codesearch

Remove legacy analytics-hadoop from grafana

Random notes

sudo lsof -Xd DEL - lists the files that have been deleted but are still held open by a running process

Puppet

https://www.digitalocean.com/community/tutorials/getting-started-with-puppet-code-manifests-and-modules

Why does sshing into mgmt not accept the password?

Because you forgot the `root@` part!

Instead of ssh dbstore1007.mgmt.e

do `ssh root@dbstore1007.mgmt.e`

Or make ssh use the root user in your ~/.ssh/config: https://stackoverflow.com/questions/10197559/ssh-configuration-override-the-default-username

refactor this to run automatically

https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS#Deploy_new_History_snapshot_for_Wikistats_Backend

Why no homer diff?

TBD

how to check what vlan a host belongs to?

???

Proposal: stop using conda for infrastructure

Why not use standard pip?

How to apply hadoop config changes?

For example https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194/1/hieradata/common.yaml

linux-host-entries.ttyS0-115200 versus linux-host-entries.ttyS1-115200

a mystery

sudo gnt-instance console an-airflow1002.eqiad.wmnet is stuck, is this normal?

Gotta stop and start, the old reboot trick

sudo gnt-instance stop an-airflow1003.eqiad.wmnet

how to restart services on hadoop coordinator?

for https://phabricator.wikimedia.org/T283067

Want to restart services for an-test-coord1001 and an-coord*

But how to do this safely?

for all things that you need to restart, it is good to make a mental list of services to restart and what impact they have
on an-coord1001 there are
1) oozie
2) presto coordinator
3) hive server
4) hive metastore
and that's it IIRC
oozie can be restarted anytime, no issue on that front (all the state is on the db)
and we don't really have clients contacting it
the presto coordinator can be restarted anytime, it is quick but it may impact ongoing queries (if any, say from superset)
the hive server/coordinator is a bit more complicated
they are quick to restart, but any client that is using them can be impacted (all oozie jobs, timers, etc..)
so the safe way is to temporary stop timers on launcher, wait for RUNNING jobs to be as few as possible and then restart server and metastore
we have the analytics-hive.eqiad.wmnet that can be used in theory, but when you failover from say an-coord1001 to 1002 the target service is only the hive server
not the metastore
ah wait I am saying something silly
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Coordinator
so on an-coord1002 we have both server and metastore
basically a mirror of 1001
what I was misremembering is that the servers use the "local" metastore, but the metastore use a specific database (in this case, the one on an-coord1001)
this is to avoid having a split brain view, we cannot use the db replica on 1002 for the metastore since it doesn't update the master when changed
so
for hive, just change the DNS of analytics-hive.eqiad.wmnet
to 1002, then wait for the TTL to expire
and you can freely restart daemons on 1001

Set boot order to disk - "upstream is aware" - any issue to track?

Ganeti#Create a VM

Can we delete the hadoop-analytics grafana section now?

https://grafana.wikimedia.org/d/000000258/analytics-hadoop?orgId=1

DONE

Puppet failure on deployment-logstash03.deployment-prep.eqiad.wmflabs

make it stop!!!

what to do about this /mnt/hdfs issue

razzi@an-test-coord1001:~$ sudo lsof -Xd DEL | wc
lsof: WARNING: can't stat() fuse.fuse_dfs file system /mnt/hdfs
      Output information may be incomplete.
     81     649    8819

a lotta deleted files still open on an-test-coord1001...?

razzi@an-test-coord1001:~$ sudo lsof -Xd DEL
lsof: WARNING: can't stat() fuse.fuse_dfs file system /mnt/hdfs
      Output information may be incomplete.
COMMAND     PID       USER  FD   TYPE DEVICE SIZE/OFF    NODE NAME
systemd       1       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
fuse_dfs    813       root DEL    REG  253,0          2107668 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libsunec.so
fuse_dfs    813       root DEL    REG  253,0          2107706 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jce.jar
fuse_dfs    813       root DEL    REG  253,0          2107712 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jsse.jar
fuse_dfs    813       root DEL    REG  253,0          2107651 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libjaas_unix.so
fuse_dfs    813       root DEL    REG  253,0          2107661 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libmanagement.so
fuse_dfs    813       root DEL    REG  253,0          2107663 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libnet.so
fuse_dfs    813       root DEL    REG  253,0          2107664 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libnio.so
fuse_dfs    813       root DEL    REG  253,0          2107690 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/nashorn.jar
fuse_dfs    813       root DEL    REG  253,0          2107685 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/cldrdata.jar
fuse_dfs    813       root DEL    REG  253,0          2107692 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunjce_provider.jar
fuse_dfs    813       root DEL    REG  253,0          2107694 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/zipfs.jar
fuse_dfs    813       root DEL    REG  253,0          2107693 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunpkcs11.jar
fuse_dfs    813       root DEL    REG  253,0          2107689 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/localedata.jar
fuse_dfs    813       root DEL    REG  253,0          2107686 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/dnsns.jar
fuse_dfs    813       root DEL    REG  253,0          2107687 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/icedtea-sound.jar
fuse_dfs    813       root DEL    REG  253,0          2107711 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jfr.jar
fuse_dfs    813       root DEL    REG  253,0          2107718 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar
fuse_dfs    813       root DEL    REG  253,0          2107688 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/jaccess.jar
fuse_dfs    813       root DEL    REG  253,0          2107671 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libzip.so
fuse_dfs    813       root DEL    REG  253,0          2107652 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libjava.so
fuse_dfs    813       root DEL    REG  253,0          2107670 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libverify.so
fuse_dfs    813       root DEL    REG  253,0          2107674 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
fuse_dfs    813       root DEL    REG  253,0          2107691 /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/ext/sunec.jar
fuse_dfs    813       root DEL    REG  253,0          1971364 /tmp/hsperfdata_root/813
systemd-l  1057       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
dbus-daem  1069 messagebus DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
lldpd      1070     _lldpd DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5
lldpd      1070     _lldpd DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5
lldpd      1077     _lldpd DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5
lldpd      1077     _lldpd DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5
mysqld     6126      mysql DEL    REG   0,17           277921 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277920 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277919 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277918 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277917 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277916 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277915 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277914 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277913 /[aio]
mysqld     6126      mysql DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
mysqld     6126      mysql DEL    REG   0,17           277912 /[aio]
mysqld     6126      mysql DEL    REG   0,17           277911 /[aio]
systemd    9708       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
(sd-pam)   9709       root DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
systemd   10294      oozie DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
(sd-pam)  10296      oozie DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
rsyslogd  15874       root DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5
rsyslogd  15874       root DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5
(sd-pam)  29881      razzi DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
kafkatee  33887   kafkatee DEL    REG  253,0          1837021 /usr/lib/x86_64-linux-gnu/liblz4.so.1.8.3
airflow   38223  analytics DEL    REG  253,0          1837030 /usr/lib/x86_64-linux-gnu/libhogweed.so.4.5
airflow   38223  analytics DEL    REG  253,0          1837035 /usr/lib/x86_64-linux-gnu/libnettle.so.6.5

Ok, unmounting and remounting fixed the jars, the warning is still there

sudo umount /mnt/hdfs
sudo mount -a

homework for you - think about a single cumin command to run the umount/mount commands above on the hosts mounting /mnt/hdfs

Is this going to involve querying for nodes with fuse?

how to make ssh an-coord1001 work without .e?

tbd

what is the deal with product data engineering? We have product analytics and we are data engineering

??