You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:Razzi/2021-06-10

From Wikitech-static
< User:Razzi
Revision as of 21:22, 10 June 2021 by imported>Razzi
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

gonna deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194

sudo cookbook sre.hadoop.roll-restart-masters analytics

ok I got an eof error somehow...

razzi@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-masters analytics                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            [0/0]
START - Cookbook sre.hadoop.roll-restart-masters
Checking HDFS and Yarn daemon status. We expect active statuses on the Master node, and standby statuses on the other. Please do not proceed otherwise.
Checking Master/Standby status.

Master status for HDFS:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.68s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Master status for Yarn:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.54s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Standby status for HDFS:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.64s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Standby status for Yarn:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.74s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Please make sure that the active/standby nodes are correct.
Type "go" to proceed or "abort" to interrupt the execution
> go
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: an-master[1001-1002].eqiad.wmnet
----- OUTPUT of 'icinga-downtime ...razzi@cumin1001"' -----
================
PASS |████████████████████████| 100% (1/1) [00:00<00:00,  2.32hosts/s]
FAIL |                                |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'icinga-downtime ...razzi@cumin1001"'.
----- OUTPUT of 'icinga-downtime ...razzi@cumin1001"' -----
================
PASS |████████████████████████| 100% (1/1) [00:00<00:00,  2.82hosts/s]
FAIL |                                |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'icinga-downtime ...razzi@cumin1001"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Restarting Yarn Resourcemanager on Master.
----- OUTPUT of 'systemctl restar...-resourcemanager' -----
================
PASS |████████████████████████| 100% (1/1) [00:11<00:00, 11.71s/hosts]
FAIL |                                |   0% (0/1) [00:11<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl restar...-resourcemanager'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Sleeping 60.0 seconds.
Restarting Yarn Resourcemanager on Standby.
----- OUTPUT of 'systemctl restar...-resourcemanager' -----
================
PASS |████████████████████████| 100% (1/1) [00:11<00:00, 11.69s/hosts]
FAIL |                                |   0% (0/1) [00:11<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl restar...-resourcemanager'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Checking Master/Standby status.

Master status for Yarn:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.75s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Standby status for Yarn:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.80s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Ok to proceed with HDFS Namenodes ?
Type "go" to proceed or "abort" to interrupt the execution
> go
Run manual HDFS failover from master to standby.
Run manual HDFS Namenode failover from an-master1001-eqiad-wmnet to an-master1002-eqiad-wmnet.
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Failover to NameNode at an-master1002.eqiad.wmnet/10.64.21.110:8040 successful
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:17<00:00, 17.95s/hosts]
FAIL |                                                                                  |   0% (0/1) [00:17<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Sleeping 30 seconds.
Restart HDFS Namenode on the master.
----- OUTPUT of 'systemctl restart hadoop-hdfs-zkfc' -----
----- OUTPUT of 'systemctl restar...op-hdfs-namenode' -----
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:29<00:00, 29.38s/hosts]
FAIL |                                                                                  |   0% (0/1) [00:29<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Sleeping 600.0 seconds.
^@Checking Master/Standby status.

Master status for HDFS:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.68s/hosts]
FAIL |                                                                                  |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Standby status for HDFS:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.65s/hosts]
FAIL |                                                                                  |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Ok to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
Exception raised while executing cookbook sre.hadoop.roll-restart-masters:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/usr/lib/python3/dist-packages/spicerack/_module_api.py", line 18, in run
    return self._run(self.args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hadoop/roll-restart-masters.py", line 154, in run
    ask_confirmation("Ok to proceed?")
  File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 67, in ask_confirmation
    ['go', 'abort'])
  File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 45, in ask_input
    response = input('> ')
EOFError
END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99)
razzi@cumin1001:~$
razzi@cumin1001:~$

Ok ran the rest of the commands manually.

See a new error on alerts.wikimedia.org:

CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics

So I pull up journalctl -u monitor_refine_eventlogging_analytics

Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: 21/06/10 00:18:23 WARN RefineMonitor: RefineMonitor found problems for path /wmf/data/raw/eventlogging -> database event (/wmf/data/event):
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: The following dataset targets in path /wmf/data/raw/eventlogging between 2021-06-08T00:15:07.000Z and 2021-06-09T20:15:07.001Z either have failed or still need
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: Targets with failures:
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=14
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=15
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=16
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=17
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=18
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=19
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: 21/06/10 00:18:23 INFO RefineMonitor: Sending problem email report to analytics-alerts@wikimedia.org
Jun 10 00:18:24 an-launcher1002 systemd[1]: monitor_refine_eventlogging_analytics.service: Main process exited, code=exited, status=1/FAILURE
Jun 10 00:18:24 an-launcher1002 systemd[1]: monitor_refine_eventlogging_analytics.service: Failed with result 'exit-code'.

Turns out the service just needed to be restarted; the dconf error was unrelated I guess