You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
User:Razzi: Difference between revisions
imported>Razzi |
imported>Razzi |
||
Line 130: | Line 130: | ||
But how to do this safely? | But how to do this safely? | ||
<pre> | |||
for all things that you need to restart, it is good to make a mental list of services to restart and what impact they have | |||
on an-coord1001 there are | |||
1) oozie | |||
2) presto coordinator | |||
3) hive server | |||
4) hive metastore | |||
and that's it IIRC | |||
oozie can be restarted anytime, no issue on that front (all the state is on the db) | |||
and we don't really have clients contacting it | |||
the presto coordinator can be restarted anytime, it is quick but it may impact ongoing queries (if any, say from superset) | |||
the hive server/coordinator is a bit more complicated | |||
they are quick to restart, but any client that is using them can be impacted (all oozie jobs, timers, etc..) | |||
so the safe way is to temporary stop timers on launcher, wait for RUNNING jobs to be as few as possible and then restart server and metastore | |||
we have the analytics-hive.eqiad.wmnet that can be used in theory, but when you failover from say an-coord1001 to 1002 the target service is only the hive server | |||
not the metastore | |||
ah wait I am saying something silly | |||
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Coordinator | |||
so on an-coord1002 we have both server and metastore | |||
basically a mirror of 1001 | |||
what I was misremembering is that the servers use the "local" metastore, but the metastore use a specific database (in this case, the one on an-coord1001) | |||
this is to avoid having a split brain view, we cannot use the db replica on 1002 for the metastore since it doesn't update the master when changed | |||
so | |||
for hive, just change the DNS of analytics-hive.eqiad.wmnet | |||
to 1002, then wait for the TTL to expire | |||
and you can freely restart daemons on 1001</pre> | |||
== Set boot order to disk - "upstream is aware" - any issue to track? == | == Set boot order to disk - "upstream is aware" - any issue to track? == |
Revision as of 16:53, 24 June 2021
Learning the Wikimedia stack!
<InputBox>
type=create
placeholder=Article name
prefix=User:Razzi/
buttonlabel=Create user article
</InputBox>
<inputbox>
type=create
prefix=User:Razzi/
default=2022-08-10
buttonlabel=Create article for day
</inputbox>
<inputbox>
type=commenttitle
page=User:Razzi
buttonlabel=New section on this page
</inputbox>
Documentation
No changes were found matching these criteria.
- 2021-04-20
- 2021-05-5
- 2021-06-09
- 2021-06-1
- 2021-06-10
- 2021-06-14
- 2021-06-30
- 2021-07-01
- 2021-07-30
- 2021-08-02
- 2021-09-09
- 2021-09-24
- 2022-02-07
- 2022-03-28
- A week with the search team
- Analytics notes
- Debugging eventlogging to druid network flows internal hourly.service
- Developing cookbook locally
- Experiment: use puppet notice to show variable
- First logical volume resizing
- First pass at understanding T300164 varnishkafka alerts
- Ganeti error: Connection to console of instance datahubsearch1002.eqiad.wmnet failed
- How to depool / pool a host from etcd
- How to run systemd unit of another user
- How to show mysql host from sql query
- How to view pooled services for lvs
- Installing puppet on mac
- Learning about partitions for flerovium/furud
- Looking into The following units failed: wmf auto restart prometheus-mysqld-exporter@matomo.service
- NameNode vs DataNode
- Notes on clouddb views
- Plan to drain hadoop cluster
- Presto query logging: https://phabricator.wikimedia.org/T269832
- Puppetboard
- Set up haproxy on mediawiki-vagrant
- Setting up kerberos locally
- Spicerack python api repl
- Superset 1.3.1 upgrade recap
- T279304
- T280132 disk swap
- Triage Superset Dashboard Timeouts - T294768
- What is conftool
- alertname: Icinga/Check correctness of the icinga configuration
- an-master reimaging
- code search
- common.js
- deployment train 5-18
- firewall audit
- fm/CFSSL
- fm/SCSI
- grand SRE IC plan
- https://phabricator.wikimedia.org/T298505
- learning storage on vagrant
- logs
- new plan for reimaging an-masters
- rebase off of origin in one command
- reimage of db1125
- scratch
- snippets
- ssh config
- ssh single letter domain shortcut
- superset 1.3.1 errors
- vector.css
- working with apache atlas in docker
Lists (https://gtdfh.liw.fi/quickie-overview/)
Questions
How does refine use salts? https://gerrit.wikimedia.org/r/c/operations/puppet/+/679939
Is /system a default directory for hadoop, or can we remove it?
Is there a place that lists the vlans?
How to check vlan for a host?
Q: Is it expected that when reimaging a host, we see the old name when running homer?
[edit interfaces interface-range disabled] - member ge-1/0/13; [edit interfaces interface-range vlan-analytics1-d-eqiad] + member ge-1/0/13; member ge-1/0/43 { ... } [edit interfaces] + ge-1/0/13 { + description "db1125 {#2221}"; + }
^ this is while decommissioning db1125
A: No, I skipped some netbox steps; when I fixed them this didn't show up
Q: How to submit a test job to the yarn queue to test if it is accepting jobs?
Q: What to do about this warning on analytics1068?
May 06 21:03:35 analytics1068 systemd[1]: /run/systemd/generator.late/hadoop-yarn-nodemanager.service:18: PIDFile= late/hadoop-yarn-nodemanager.service:18: PIDFile= references path below legacy directory /var/run/, updating /var/run/hadoop-yarn/yarn-yarn-nodemanager.pid → /run/hadoop-yarn/yarn-yarn-nodemanager.pid → /run/hadoop-yarn/yarn-yarn-nodemanager.pid; please update the unit file accordingly.
Q: Server Lifecycle#Rename while reimaging when to merge homer patch?
A: homer patch is for firewall, not having to do with the reimaging process. Merge after reimage complete
Q: What is the order for creating puppet patches when it comes to server lifecycle? Some things that might need to be avoided: having site.pp for node that is being decommissioned, having site.pp for node that doesn't exist yet
Ideas
Script to show what tickets are currently in progress
Add homer-public to codesearch
Remove legacy analytics-hadoop from grafana
Random notes
sudo lsof -Xd DEL
- lists the files that have been deleted but are still held open by a running process
Puppet
Why does sshing into mgmt not accept the password?
Because you forgot the `root@` part!
Instead of ssh dbstore1007.mgmt.e
do `ssh root@dbstore1007.mgmt.e`
Or make ssh use the root user in your ~/.ssh/config: https://stackoverflow.com/questions/10197559/ssh-configuration-override-the-default-username
refactor this to run automatically
Why no homer diff?
TBD
how to check what vlan a host belongs to?
???
Proposal: stop using conda for infrastructure
Why not use standard pip?
How to apply hadoop config changes?
For example https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194/1/hieradata/common.yaml
linux-host-entries.ttyS0-115200 versus linux-host-entries.ttyS1-115200
a mystery
sudo gnt-instance console an-airflow1002.eqiad.wmnet is stuck, is this normal?
Gotta stop and start, the old reboot trick
sudo gnt-instance stop an-airflow1003.eqiad.wmnet
how to restart services on hadoop coordinator?
for https://phabricator.wikimedia.org/T283067
Want to restart services for an-test-coord1001 and an-coord*
But how to do this safely?
for all things that you need to restart, it is good to make a mental list of services to restart and what impact they have on an-coord1001 there are 1) oozie 2) presto coordinator 3) hive server 4) hive metastore and that's it IIRC oozie can be restarted anytime, no issue on that front (all the state is on the db) and we don't really have clients contacting it the presto coordinator can be restarted anytime, it is quick but it may impact ongoing queries (if any, say from superset) the hive server/coordinator is a bit more complicated they are quick to restart, but any client that is using them can be impacted (all oozie jobs, timers, etc..) so the safe way is to temporary stop timers on launcher, wait for RUNNING jobs to be as few as possible and then restart server and metastore we have the analytics-hive.eqiad.wmnet that can be used in theory, but when you failover from say an-coord1001 to 1002 the target service is only the hive server not the metastore ah wait I am saying something silly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Coordinator so on an-coord1002 we have both server and metastore basically a mirror of 1001 what I was misremembering is that the servers use the "local" metastore, but the metastore use a specific database (in this case, the one on an-coord1001) this is to avoid having a split brain view, we cannot use the db replica on 1002 for the metastore since it doesn't update the master when changed so for hive, just change the DNS of analytics-hive.eqiad.wmnet to 1002, then wait for the TTL to expire and you can freely restart daemons on 1001
Set boot order to disk - "upstream is aware" - any issue to track?
Can we delete the hadoop-analytics grafana section now?
https://grafana.wikimedia.org/d/000000258/analytics-hadoop?orgId=1
DONE