You are browsing a read-only backup copy of Wikitech. The live site can be found at

Data Engineering/Ops week: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
Line 181: Line 181:

Example: sqoop jobs, reportupdater jobs, and everything not yet migrated to Airflow
Example: sqoop jobs, reportupdater jobs, and everything not yet migrated to Airflow
====Sqoop failures====
Sqoop is run on systemd timers right now (may be Airflow soon or we may migrate it to spark).  If it fails, look at [[Analytics/Systems/Cluster/Edit_history_administration]].

====Data loss alarms====
====Data loss alarms====

Revision as of 14:39, 5 August 2022

We rotate ops duty in pairs. On any given week, a member of the team is the primary on-duty ops person. This entails looking after the list of tasks detailed here. At the end of the week, the primary person stays on as a secondary and another member becomes primary. This page can be a checklist of things to do during ops week. Please keep it concise and actionable, and for longer explanations open a new page and link to it from here. The primary person is expected to monitor alarms and begin reacting to problems. The secondary person is there to pair, help with tasks that take more time, and importantly to decide what problems should be addressed immediately and what could be made into a task that gets prioritized as normal. This last duty will be somewhat temporary until the team has established formal SLOs and error budgets.

If it's your ops week, there will be a one hour timebox in your Google Calendar to fulfill your duties, which by default is set to 3pm UTC every weekday. Obviously not everyone's schedule will be able to accommodate this time, but make sure that you have time during your ops week to dedicate 100% to ops if needed. This means that if you are working on a time critical task or you are taking vacation you might need to swap ops weeks with someone else.

Remember you're a first responder. This means you don't have to fix whatever is broken during your ops week, but you have to alert the team, check the status of failed stuff, and report what's happened, alerts need to be acknowledged promptly so other team members know they are being taken care of.

Ops week is an opportunity to learn, do not be shy in asking for help from others.

Hand off

Ops week ends and begins on the Thursday of every week. We have a task grooming session scheduled for every Thursday. The beginning of this meeting is used to quickly groom tasks in the 'Ops Week' column on the Analytics Phabricator Board. The Ops Week column should then only contain actionable tasks for those on Ops Week duty for the next week.


Any decisions taken with a subset of team that all team members should know about need to be communicated to the rest of team via e-mail to analytics-alerts@. Example: deployments happening not happening, issues we have decided not to fix and schedule through regular kanban work. New, ahem, "fun discoveries" that you think might help the next opsen.

Working in off hours

The developer ops rotation does not include working off hours as its focus is on"routine" ops. The bulk of analytics infra is tier-2 and jobs that fail, for example, can be restarted the next day. If there are system issues (like The namenode going down) there should be ICINGA alarms fired that will go to the SRE team.

The SRE team needs to makes sure that at all times there is 1 person available for ops, which, in practice means taking turns to take vacations.

Things to do at some point during the week

Ops week tasks

Check the ops week column in the Data Engineering phabricator board to check if the SRE people have left anything for you to do.

Have any users left the Foundation?

We should check for users who are no longer employees of the foundation, and offboard them from the various places in our infrastructure where user content is stored. Ideally a Phabricator task should be open notifying the user and/or their team manager so that they have a chance to respond and move things. Generally, as part of the SRE offboarding script we will be notified when a user leaves and the task will be added to the on call column. We should remove the following:

  • Home directories in the stat machines.
  • Home directory in HDFS.
  • User hive tables (check if there is a hive database named after the user).
  • Ownership of hive tables needed by others (find tables with scripts below and ask what the new user should be, preferably a system account like analytics or analytics-search).

Use the check-user-leftovers script from your local shell (it uses your ssh keys), and copy the output into the Phabricator task. The tricky part is dropping data from the Hive warehouse. Before starting, let's remember that:

  • under /user/hive/warehouse we can find directories related to databases (usually suffixed with .db) that in turn contain directories for tables.
  • If the table is internal, then the data files will be found in a subdir of the HDFS Hive warehouse directory.
  • If the table is external, then the data files could potentially be anywhere in HDFS.
  • A Hive drop database command deletes metadata in the Hive Metastore, the actual HDFS files needs to be cleaned up manually.

The following command is useful to understand where is the location of the HDFS files of the tables belonging to a Hive database:

for t in $(hive -S -e "show tables from $DATABASE_TO_CHECK;" 2>/dev/null | grep -v tab_name); do echo "checking table: $DATABASE_TO_CHECK.$t"; hive -S -e "describe extended $DATABASE_TO_CHECK.$t" 2>/dev/null | egrep -o 'location:hdfs://[0-9a-z/\_-]+'; done

Full removal of files and Hive databases and tables

Once the list of things to remove has been reviewed, and it has been determined it is okay to remove, the following commands will help you do so:

Drop Hive database. From an-launcher1002 (or somewhere with the hdfs user account):

sudo -u hdfs kerberos-run-command hdfs hive
> DROP DATABASE <user_database_name_here> CASCADE;

Remove user's Hive warehouse database directory. From an-launcher1002:

sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r /user/hive/warehouse/<user_database_name_here>.db

Remove HDFS homedir. From an-launcher1002:

sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r /user/<user_name>

Remove homedirs from regular filesystem on all nodes. From cumin1001:

sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::master::standby' 'rm -rf /home/<user_name>'

Archival of user files

If necessary, we archive user files in HDFS in /wmf/data/archive/user/<user_name>. Make a directory with the user's name and copy any files there as needed.

Things to do every Tuesday

The Data Engineering deployment train 🚂

After the Data Engineering standup, around 5-6pm UTC, we deploy anything that has changes pending. Take a look at the Ready to Deploy column in the Kanban to identify what needs to be released, and notify the team of each deployment (using !log on IRC).

Deployments normally include restarting of jobs and updating jar versions, while some of us have super powers and can keep all that their head others need the etherpad to write down the deployment plan, please use it and document the deployment of the week here:

Here are instructions on each project's deployment:

Deploy Eventlogging

Deploy Wikistats UI

Deploy AQS

Deploy refinery source

Deploy refinery

Deploy Event Streams (more involved, ask Otto, but this is part of it)

Things to check every day

(We're centralizing all scheduled manual work here both as a reminder and as a TODO list for automation)

Is there a new Mediawiki History snapshot ready? (beginning of the month)

We need to tell AQS manually that a new snapshot is available so that the service reads from the latest data. This will happen around the middle of the month. Follow these instructions to do so.

The mediaiwki history process is documented here: Analytics/Systems/Data Lake/Administration/Edit/Pipeline Notice there are 3 sqoop jobs, one for mediawiki-history from labs, one for cu_changes (geoeditors), and one for sqooping tables like actor and comment from the production replicas.

Have new wikis been created?

This one will be easy to spot. If a new wiki receives pageviews and it's not on our allowlist, the team will start receiving alerts about pageviews from an unregistered Wikimedia site. The new wikis must be added in two places: the pageview allowlist and the sqoop groups list.

Adding new wikis to the sqoop list

First check that the wiki has already replicas...

  • In production DB:
# connect to the analytics production replica for the project to check
analytics-mysql azywiki
# list tables available for that project - An empty list means the project is not yet synchronized and shouldn't be added to sqoop list
show tables;
  • In labs DB:
# connect to the analytics labs replica for the project to check (needs the password for user s53272)
# you need to know which port to use, so first you check for that in production DB:
analytics-mysql --print-taget azywiki

# Keep only the port (3315 in that example) and use if in connecting to cloudDB:
mysql --database azywiki_p -h clouddb1021.eqiad.wmnet -P 3315 -u s53272 -p
# list available tables for the project - An empty list means the project is not yet synchronized and shouldn't be added to sqoop list
show tables;

For wikis to be included in the next mediawiki history snapshot, they need to be added to the labs sqoop list. Add them to the last group (the one that contains the smallest wikis) here. Once the cluster is deployed (see section above) the wikis will be included in the next snapshot.

Adding new wikis to the pageview allow list

First list in hive all the current pageview allow list exceptions:

select * from wmf.pageview_unexpected_values where year=... and month=...;

That will tell you what you need to add to the list. Take a look at it, make sure it makes sense, and make a patch for it (example in gerrit). To stop the oozie alarms faster, you can merge your patch and sync the new file up to hdfsː

scp static_data/pageview/whitelist/whitelist.tsv an-launcher1002.eqiad.wmnet:
ssh an-launcher1002.eqiad.wmnet
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -put -f whitelist.tsv /wmf/refinery/current/static_data/pageview/whitelist/whitelist.tsv
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod +r /wmf/refinery/current/static_data/pageview/whitelist/whitelist.tsv

Have any alarms been triggered?


Airflow jobs run on one of our Airflow instances. If you receive an email like <Taskinstance: ... [failed]> or SLA miss on DAG, the best way to investigate is the Airflow UI (see Airflow Instance-specific instructions). Once in the Airflow UI, you can check logs and retry or pause the DAG accordingly. If the error was transient or you can deploy a simple fix, you can retry by clearing the failed task instances. Otherwise, pausing the DAG will stop further alert emails. Airflow documentation is excellent and worth learning, you can find links in the UI header.

Canary alarms

The produce_canary_events job will fail if any event fails to be produced. EventGate returns the reason for failure in its response. This will be logged in the job's log file in /var/log/refinery/produce_canary_events/produce_canary_events.log on an-launcher1002 (as of 2021-01).

If an event is invalid, you'll see an error message like: Jan 12 19:15:04 an-launcher1002 produce_canary_events[15437]: HttpResult(failure) status: 207 message: Multi-Status. body: {"invalid":[{"status":"invalid","event":{"event":{"is_mobile":true,"user_editcount":123,"user_id":456,"impact_module_state":"activated","start_email_state":"noemail","homepage_pageview_token":"example token"},"meta":{"id":"b0caf18d-6c7f-4403-947d-2712bbe28610","stream":"eventlogging_HomepageVisit","domain":"canary","dt":"2021-01-12T19:15:04.339Z","request_id":"54df8880-61ce-4cf9-86fa-342c917ea622"},"dt":"2020-04-02T19:11:20.942Z","client_dt":"2020-04-02T19:11:20.942Z","$schema":"/analytics/legacy/homepagevisit/1.0.0","schema":"HomepageVisit","http":{"request_headers":{"user-agent":"Apache-HttpClient/4.5.12 (Java/1.8.0_272)"}}},"context":{"errors":[{"keyword":"required","dataPath":".event","schemaPath":"#/properties/event/required","params":{"missingProperty":"start_tutorial_state"},"message":"should have required property 'start_tutorial_state'"}],"errorsText":"'.event' should have required property 'start_tutorial_state'"}}],"error":[]}

The canary event is constructed from the schema's examples. In this error message, the schema at /analytics/legacy/homepagevisit/1.0.0 examples was missing a required field, and the event failed validation.

The fix will be dependent on the problem. In this specific case, we modified what should have been an immutable schema after eventgate had cached the schema, so eventgate needed a restart to flush caches.

Ask ottomata for help with this if you encounter issues.

Anomaly detection alarms

There are three different types of alarms: outage/censorship, general oddities on refine system measured by calculating entropy of user agent and mobile pageview alarms.

Superset dashboard:

  • Outage/censorship alarms. These alarms measure changes in traffic per city. When they raise look whether the overall volume of pageviews is OK, if it is issue might be due to an outage on a particular country. Traffic will troubleshoot further
  • Eventlogging refine alarms on navigationtiming data. Variations here might indicate a problem in the refine pipeline (like all user agents now being null) or rather an update of UA parser.
  • Mobile pageview alarms. An alarm might indicate a drop on mobile app or mobile web pageviews, do check numbers in turnilo. These alarms are not setup for desktop pageviews as nature of timeseries is quite different.

A through description of the system can be found here

Failed oozie jobs

Check if any work is already being done in IRC and the analytics SAL. Follow the steps in Analytics/Systems/Cluster/Oozie/Administration to restart a job.

Log any job restart: If you're going through oozie alert emails or something similar, double check the SAL to see whether work has already been done to fix the problem.

Failed systemd/journald based jobs

Follow the page on managing systemd timers and see what you can do. Notify an ops engineer if you don't know what you're doing.

Example: sqoop jobs, reportupdater jobs, and everything not yet migrated to Airflow

Sqoop failures

Sqoop is run on systemd timers right now (may be Airflow soon or we may migrate it to spark). If it fails, look at Analytics/Systems/Cluster/Edit_history_administration.

Data loss alarms

Follow the steps in Analytics/Systems/Dealing with data loss alarms

Mediawiki Denormalize checker alarms

If a mediawiki snapshot fails its Checker step you will get an alarm via e-mail, this is what to do: Analytics/Systems/Cluster/Edit_history_administration#QA:_Assessing_quality_of_a_snapshot

Reportupdater failures

Check the time that the alarm was triggered at and look for causes of the problems in the logs at

sudo journalctl -u reportupdater-interlanguage

(Replace interlanguage with the failed unit)

Druid indexation job fails

Take a look at the Admin console and the indexation logs. The logs show the detailed errors, but the console can be easier to look at and spot obvious problems.

Deletion script alarms

When data directories and hive partitions are out of sync the deletion scripts can fail. To resync directories and partitions execute msck repair table <table_name>; in Hive.

Refine failure report

An error like Refine failure report for refine_eventlogging_analytics will usually have some actual exception information in the email itself. Take a look and most likely rerun the hour(s) that failed: Analytics/Systems/Refine#Rerunning_jobs

Refine's failure report email should contain the flags you need to rerun whatever might have failed. The name of the script to run is in the subject line. E.g. if the subject line says for 'refine_eventlogging_analytics', that is the command you will need to run. You can copy and paste the flags from the failure email to this command, like:

sudo -u analytics kerberos-run-command analytics refine_eventlogging_analytics --ignore_failure_flag=true --table_include_regex='contenttranslation' --since='2021-11-29T22:00:00.000Z' --until='2021-11-30T23:00:00.000Z'

Wait for the job to finish, and then check the logs to see that all went well.

sudo -u analytics yarn logs -applicationId <app_id_here> | grep Refine:
21/12/02 19:30:53 INFO Refine: Finished refinement of dataset hdfs://analytics-hadoop/wmf/data/raw/eventlogging_legacy/eventlogging_ContentTranslation/year=2021/month=11/day=30/hour=21 -> `event`.`contenttranslation` /wmf/data/event/contenttranslation/year=2021/month=11/day=30/hour=21. (# refined records: 66)

For errors like RefineMonitor problem report for job monitor_refine_sanitize_eventlogging_analytics, visit Backfilling sanitization.

Debug Java applications in trouble

When a java daemon misbehaves (like in it is absolutely vital to get some info from it before a restart, otherwise it will be difficult to report a problem upstream. The jstack utility seems to be the best candidate for this job, and plenty of guides can be found on the internet. For a quick copy/paste tutorial:

  • use ps -auxff | grep $something to correctly identify the PID of the process
  • then run sudo -u $user jstack -l $PID > /tmp/thread_dump

The $user referenced in the last command is the user running the Java daemon. You should have the rights to sudo as that user, but if not pleas ping Luca or Andrew.

This guide is concise and neat for a more verbose explanation:

Useful administration resources

Useful Kerberos commands

Since Kerberos was introduced, proper authentication has been enforced for all access to HDFS. Sometimes Analytics admins need to access resources of another user, for example to debug or fix something, so more powerful credentials are needed. In order to be like root on HDFS, you need to ssh to any of an-master1001, an-master1002 or an-coord1001 and run:

sudo -u hdfs kerberos-run-command hdfs command-that-you-want-to-execute

Special example to debug Yarn application logs:

sudo -u hdfs kerberos-run-command hdfs yarn logs -applicationId application_1576512674871_263229 -appOwner elukey

In the above case, Analytics admins will be able to pull logs for user elukey and app-id application_1576512674871_263229 without incurring in permission errors.

Restart daemons

On most of the Analytics infrastructure the following commands are available for Analytics admins:

sudo systemctl restart name-of-the-daemon
sudo systemctl start name-of-the-daemon
sudo systemctl stop name-of-the-daemon
sudo systemctl reset-failed name-of-the-daemon
sudo systemctl status name-of-the-daemon
sudo journalctl -u name-of-the-daemon

For example, let's say a restart for the Hive server is needed and no SREs are around:

sudo systemctl restart hive-server2