You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Discovery/Analytics"

From Wikitech-static
Jump to navigation Jump to search
imported>Ebernhardson
(random debugging notes)
imported>Ebernhardson
Line 1: Line 1:
Discovery uses the [[Analytics/Cluster|Analytics Cluster]] to support CirrusSearch. The source code is in the [https://phabricator.wikimedia.org/diffusion/WDAN/ wikimedia/discovery/analytics repository]. There is an additional repository at <code>search/airflow</code> which deploys the airflow service much of the analytics code depends on for scheduling.  
Discovery uses the [[Analytics/Cluster|Analytics Cluster]] to support CirrusSearch. The source code is in the [https://phabricator.wikimedia.org/diffusion/WDAN/ wikimedia/discovery/analytics repository]. There is an additional repository at <code>search/airflow</code> which deploys the airflow service much of the analytics code depends on for scheduling.  


== search/airflow ==
= search/airflow =


The <code>search/airflow</code> repository deploys all necessary code to run the airflow service. This includes dependencies needed by upstream airflow, along with dependencies for our own airflow plugins.
The <code>search/airflow</code> repository deploys all necessary code to run the airflow service. This includes dependencies needed by upstream airflow, along with dependencies for our own airflow plugins.


=== Updating dependencies ===
== Updating dependencies ==


Update the requirements.txt file in the base of the repository to include any new or updated dependencies. The related requirements-frozen.txt file is maintained by the <code>make_wheels.sh</code> script. To avoid unrelated updates when building wheels this process first installs requirements-frozen.txt, then installs requirements.txt on top of that. Due to this packages can be adjusting only requirements.txt, but removing a package requires removing it from requirements-frozen.txt as well.
Update the requirements.txt file in the base of the repository to include any new or updated dependencies. The related requirements-frozen.txt file is maintained by the <code>make_wheels.sh</code> script. To avoid unrelated updates when building wheels this process first installs requirements-frozen.txt, then installs requirements.txt on top of that. Due to this packages can be adjusting only requirements.txt, but removing a package requires removing it from requirements-frozen.txt as well.
Line 16: Line 16:
#: <code>bin/upload_wheels.py artifacts/*.py</code>
#: <code>bin/upload_wheels.py artifacts/*.py</code>


=== Deploy ===
== Deploy ==


From <code>deployment.eqiad.wmnet</code>:
From <code>deployment.eqiad.wmnet</code>:
Line 25: Line 25:
#: <code>scap deploy 'a meaningful description'</code>
#: <code>scap deploy 'a meaningful description'</code>


=== Post-Deploy ===
== Post-Deploy ==


Scap does not auto-restart the related airflow services (but it should!). After deployment you will need to login to an-airflow1001 and restart the <code>airflow-scheduler</code> and <code>airflow-webserver</code> services.
Scap does not auto-restart the related airflow services (but it should!). After deployment you will need to login to an-airflow1001 and restart the <code>airflow-scheduler</code> and <code>airflow-webserver</code> services.


=== Debugging ===
== Debugging ==


* Individual job logs are available from the airflow web ui
* Individual job logs are available from the airflow web ui
Line 37: Line 37:
* Airflow jobs that previously finished can be re-triggered from the UI if necessary. From the dag tree view select the earliest task in a dagrun that should be cleared, and from the modal that pops up select clear (with defaults of downstream recursive). This will put all downstream tasks back to a clean state and they will be re-run as before.
* Airflow jobs that previously finished can be re-triggered from the UI if necessary. From the dag tree view select the earliest task in a dagrun that should be cleared, and from the modal that pops up select clear (with defaults of downstream recursive). This will put all downstream tasks back to a clean state and they will be re-run as before.


== wikimedia/discovery/analytics ==
= wikimedia/discovery/analytics =


The <code>wikimedia/discovery/analytics</code> repository contains airflow DAGs, various python scripts invoked by those DAGs, and custom python environments that airflow will use to run the scripts in.
The <code>wikimedia/discovery/analytics</code> repository contains airflow DAGs, various python scripts invoked by those DAGs, and custom python environments that airflow will use to run the scripts in.


=== How to prepare a deploy ===
== How to prepare a commit ==


If any python environments have been updated the artifacts/ directory needs to be updated as well. On a debian 9.12 (matching yarn worker nodes) instance:
=== Airflow fixtures ===


#: <code>bin/make_wheels.sh</code>
For many of the tasks run by airflow the test suite runs them and records the commands that would have been executed as test fixtures. This helps keep track of what exactly is changing when we deploy, and helps reviewers understand the effect of changes to the dags. These fixtures are rarely edited by hand, rather the standard workflow is to delete all fixtures are allow the test suite to rebuild them. Changes to the fixtures are then reviewed and added to the commit using standard git tooling.
#: <code>bin/upload_wheels.py artifacts/*.whl</code>
#: <code>git add artifacts/*.whl</code>


The <code>make_wheels.py</code> script checks each directory in <code>environments/</code> a <code>requirements.txt</code>. If present all necessary wheel's are collected into <code>artifacts/</code> and a <code>requirements-frozen.txt</code> containing exact versioning is emitted.
Fixtures can be rebuilt using the same docker image used to run tests in CI:


The <code>upload_wheels.py</code> script will sync the wheels to archiva. If a wheel already exists in archiva for the same version but a different sha1 your local version will be replaced. Wheel upload requires setting up maven with WMF's [[Archiva]] configuration.
  find airflow/test/fixtures -name \*.expected -exec rm '{}' ';'


=== How to deploy ===
  docker run -it --rm -e REBUILD_FIXTURES=yes -e XDG_CACHE_HOME=/src/.cache \
    -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v $PWD:/src:rw --user $(id -u):$(id -g) \
    --entrypoint /usr/local/bin/tox \
    docker-registry.wikimedia.org/releng/tox-pyspark:0.6.0 \
    -e airflow -- -e pytest
 
== How to prepare a deploy ==
 
=== Updating python scripts ===
 
Tasks defined as python scripts in the wikimedia/discovery/analytics repository require no special preparation prior to deployment. For the environments the scripts run within, see below.
 
=== Updating python environments ===
 
If any python environments have been updated the <code>artifacts/</code> directory needs to be updated as well. On a debian 9.12 (matching yarn worker nodes) instance:
 
  bin/make_wheels.sh
  bin/upload_wheels.py artifacts/*.whl
  git add artifacts/*.whl
 
The <code>make_wheels.py</code> script checks each directory in <code>environments/</code> for <code>requirements.txt</code>. If present all necessary wheel's are collected into <code>artifacts/</code> and a <code>requirements-frozen.txt</code> containing exact versioning is emitted.
 
The <code>upload_wheels.py</code> script will sync the wheels to archiva. If a wheel already exists in archiva for the same version but a different sha1 your local version will be replaced. Wheel upload requires setting up maven with WMF's [[Archiva]] configuration. This step can be run from any instance, it does not have the strict versioning requirement of the previous step.
 
=== Updating java jars ===
 
Tasks defined as java jars source their jar from the <code>artifacts/</code> directory. To update a task to a new jar:
 
# Release a new version of the job jar to archiva ([[Discovery/Analytics/Glent#Release|example]]).
# Copy <code>{project}-{version}-jar-with-dependencies.jar</code> from the job project <code>target/</code> directory into the analytics project <code>artifacts/</code> directory
# Remove (<code>git rm</code>) the old version of the jar from the artifacts/ directory
# Grep the repository (likely <code>airflow/dags</code> or <code>airflow/config</code> directories) for references to the old jar version, update with new version
 
Changes to the airflow dag will require [[#Airflow fixtures|rebuilding the fixtures]]. The jar copied into the repository will not be uploaded and must already exist in archiva. The local copy of the file will be used to derive the hash included in the commit. There are no CI procedures to ensure a jar added in this way actually exists in archiva. To verify manually create a new clone of the repository:
  git clone . /tmp/deleteme
pull the artifacts:
  cd /tmp/deleteme
  git fat init && git fat pull
 
=== Updating airflow dags ===
 
No special preparation is required when deploying dag updates. Newly created dags will start in the off position, updated dags will pick up their changes within a minute of deployment.
 
== How to deploy ==


# Ssh to [[deployment.eqiad.wmnet]]
# Ssh to [[deployment.eqiad.wmnet]]
Line 66: Line 107:
#:This part brings the discovery analytics code from gerrit to <code>stat1007</code> and <code>an-airflow1001</code>. It additionally builds the virtualenv's necessary. Airflow DAGs are automatically loaded on deployment, but new dags start in the <code>off</code> position.
#:This part brings the discovery analytics code from gerrit to <code>stat1007</code> and <code>an-airflow1001</code>. It additionally builds the virtualenv's necessary. Airflow DAGs are automatically loaded on deployment, but new dags start in the <code>off</code> position.


=== How to deploy Oozie production jobs ===
=== Airflow Production Deployment ===
 
Airflow dags are scheduled from <code>an-airflow1001</code>. After scap deployment airflow will see the updated dags within one minute.
 
WARNING: If any python in the <code>airflow/plugins</code> directory is updated by the deployment the <code>airflow-scheduler</code> and <code>airflow-webserver</code> systemd service will need to be restarted to ensure all systems use the updated plugin code.
 
=== Airflow Test Deployment ===
 
Airflow test deployment are done by creating a python <code>virtualenv</code> on an analytics network machine(ex: <code>stat1006.eqiad.wmnet</code>) within the testers home directory. All Dag input/output paths are controlled by variables in the <code>airflow/config</code> directory. These variables are automatically deployed during scap deployment, but will not be automatically available in a test deployment. Variables can be adjusted to appropriate test paths and imported to the test instance to allow performing test invocations.
 
Alternatively, many airflow tasks result in essentially executing a shell command (spark-submit, hive, skein, etc.) and waiting for it to complete. When possible prefer to test running those shell commands directly with modified command lines over performing airflow test deployments. Deeper or more complex (mjolnir) dags may require test deployment due to the number of tasks involved.
 
=== Oozie Production Deployment ===
 
Oozie jobs are deployed from <code>stat1007</code>. The following environment variables are used to kick off all jobs:
Oozie jobs are deployed from <code>stat1007</code>. The following environment variables are used to kick off all jobs:


Line 81: Line 135:


   sudo -u analytics-search oozie job -oozie $OOZIE_URL -kill 0000520-160202151345641-oozie-oozi-C
   sudo -u analytics-search oozie job -oozie $OOZIE_URL -kill 0000520-160202151345641-oozie-oozi-C
==== popularity_score ====
  export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | grep -v -- -dirty | tail -n 1 | sed 's/^.*\///')
  export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | grep -v -- -dirty | tail -n 1 | sed 's/^.*\///')
  export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties
  export START_TIME=2016-01-05T11:00Z
 
  cd /srv/deployment/wikimedia/discovery/analytics
  sudo -u analytics-search \
    kerberos-run-command analytics-search
    oozie job \
    -oozie $OOZIE_URL \
    -run \
    -config $PROPERTIES_FILE \
    -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
    -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
    -D queue_name=production \
    -D start_time=$START_TIME
==== transfer_to_es ====
This job runs a single coordinator to transfer popularity score from hdfs to kafka, to be loaded into the elasticsearch cluster by the mjolnir bulk daemon on the production side.
  export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///')
  export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | grep -v -- -dirty | tail -n 1 | sed 's/^.*\///')
  export PROPERTIES_FILE=oozie/transfer_to_es/coordinator.properties
  export START_TIME=2016-01-05T11:00Z
 
  cd /srv/deployment/wikimedia/discovery/analytics
  sudo -u analytics-search \
    kerberos-run-command analytics-search \
    oozie job \
    -oozie $OOZIE_URL \
    -run \
    -config $PROPERTIES_FILE \
    -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
    -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
    -D queue_name=production \
    -D start_time=$START_TIME


==== query_clicks ====
==== query_clicks ====
Line 145: Line 160:
       -D start_time=$START_TIME \
       -D start_time=$START_TIME \
       -D namespace_map_snapshot_id=2019-11 \
       -D namespace_map_snapshot_id=2019-11 \
      -D send_error_email_address=discovery-alerts@lists.wikimedia.org
  done
==== search_satisfaction ====
This job has 2 coordinators, a daily job to collect metrics, and a daily job to export those metrics to druid.
  export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///')
  export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | grep -v -- -dirty | tail -n 1 | sed 's/^.*\///')
  export START_TIME=2016-01-05T11:00Z
 
  cd /srv/deployment/wikimedia/discovery/analytics
  for PROPERTIES_TYPE in daily druid/daily; do
    export PROPERTIES_FILE="oozie/search_satisfaction/${PROPERTIES_TYPE}/coordinator.properties"
    sudo -u analytics-search \
      kerberos-run-command analytics-search \
      oozie job \
      -oozie $OOZIE_URL \
      -run \
      -config $PROPERTIES_FILE \
      -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
      -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
      -D queue_name=default \
      -D start_time=$START_TIME \
      -D druid_datasource=test_search_satisfaction_hourly \
      -D send_error_email_address=discovery-alerts@lists.wikimedia.org
  done
==== glent ====
  export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///')
  export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | grep -v -- -dirty | tail -n 1 | sed 's/^.*\///')
  export START_TIME=2016-01-05T11:00Z
 
  cd /srv/deployment/wikimedia/discovery/analytics
  for PROPERTIES_TYPE in m0prep m0run m1prep m1run m2run esbulk; do
    export PROPERTIES_FILE="oozie/glent/${PROPERTIES_TYPE}/coordinator.properties"
    sudo -u analytics-search \
      kerberos-run-command analytics-search \
      oozie job \
      -oozie $OOZIE_URL \
      -run \
      -config $PROPERTIES_FILE \
      -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
      -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
      -D queue_name=default \
      -D start_time=$START_TIME \
       -D send_error_email_address=discovery-alerts@lists.wikimedia.org
       -D send_error_email_address=discovery-alerts@lists.wikimedia.org
   done
   done
Line 201: Line 169:
After testing you job be sure to kill it (the correct one!) from [https://hue.wikimedia.org/oozie/list_oozie_coordinators/ hue]. Note that most of the time you won't need to do a full test through oozie, you can instead call the script directly with <code>spark-submit</code>.
After testing you job be sure to kill it (the correct one!) from [https://hue.wikimedia.org/oozie/list_oozie_coordinators/ hue]. Note that most of the time you won't need to do a full test through oozie, you can instead call the script directly with <code>spark-submit</code>.


===== deploy test code to hdfs =====
==== deploy test code to hdfs ====
    
    
   git clone http://gerrit.wikimedia.org/r/wikimedia/discovery/analytics ~/discovery-analytics
   git clone http://gerrit.wikimedia.org/r/wikimedia/discovery/analytics ~/discovery-analytics
Line 207: Line 175:
   ~/discovery-analytics/bin/discovery-deploy-to-hdfs --base hdfs:///user/$USER/discovery-analytics --verbose --no-dry-run
   ~/discovery-analytics/bin/discovery-deploy-to-hdfs --base hdfs:///user/$USER/discovery-analytics --verbose --no-dry-run


===== popularity_score =====
===== submitting test coordinator/workflow =====
 
  export DISCOVERY_VERSION=current
  export REFINERY_VERSION=current
  export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties
  cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION
  oozie job -oozie $OOZIE_URL \
            -run \
            -config $PROPERTIES_FILE \
            -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \
            -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
            -D start_time=2016-01-22T00:00Z \
            -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \
            -D popularity_score_table=$USER.discovery_popularity_score \
<br />
===== transfer_to_es =====


  export DISCOVERY_VERSION=current
When submitting a coordinator or workflow with the <code>oozie</code> command be sure to provide appropriate paths, such as:
  export ANALYTICS_VERSION=current
   -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie
  export PROPERTIES_FILE=oozie/transfer_to_es/coordinator.properties
  export START_TIME=2016-02-22T11:00Z
  cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION
   oozie job -oozie $OOZIE_URL \
            -run \
            -config $PROPERTIES_FILE \
            -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \
            -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
            -D start_time=$START_TIME \
            -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \
            -D popularity_score_table=$USER.discovery_popularity_score
[[Category:Discovery]]


== Debugging ==
== Debugging ==

Revision as of 22:10, 14 September 2020

Discovery uses the Analytics Cluster to support CirrusSearch. The source code is in the wikimedia/discovery/analytics repository. There is an additional repository at search/airflow which deploys the airflow service much of the analytics code depends on for scheduling.

search/airflow

The search/airflow repository deploys all necessary code to run the airflow service. This includes dependencies needed by upstream airflow, along with dependencies for our own airflow plugins.

Updating dependencies

Update the requirements.txt file in the base of the repository to include any new or updated dependencies. The related requirements-frozen.txt file is maintained by the make_wheels.sh script. To avoid unrelated updates when building wheels this process first installs requirements-frozen.txt, then installs requirements.txt on top of that. Due to this packages can be adjusting only requirements.txt, but removing a package requires removing it from requirements-frozen.txt as well.

  1. Build the docker environment for make_wheels. Docker is used to ensure this is compatible with debian 10.2 which runs on an-airflow1001:
    docker build -t search-airflow-make-wheels:latest .
  2. Run make_wheels:
    docker run --rm -v $PWD:/src:rw --user $(id -u):$(id -g) search-airflow-make-wheels:latest
  3. Upload wheels to archiva:
    bin/upload_wheels.py artifacts/*.py

Deploy

From deployment.eqiad.wmnet:

  1. cd /srv/deployment/search/airflow
    git fetch origin/master
    git log HEAD..origin/master
    git pull --ff-only origin/master
    scap deploy 'a meaningful description'

Post-Deploy

Scap does not auto-restart the related airflow services (but it should!). After deployment you will need to login to an-airflow1001 and restart the airflow-scheduler and airflow-webserver services.

Debugging

  • Individual job logs are available from the airflow web ui
  • Typically those logs are only for a bit of control code, and report a yarn application id for the real work.
  • `sudo -u analytics-search kerberos-run-command analytics-search yarn logs -applicationId <app_id_from_logs> | less`
    • always redirect the output! These can be excessively verbose
  • Airflow jobs that previously finished can be re-triggered from the UI if necessary. From the dag tree view select the earliest task in a dagrun that should be cleared, and from the modal that pops up select clear (with defaults of downstream recursive). This will put all downstream tasks back to a clean state and they will be re-run as before.

wikimedia/discovery/analytics

The wikimedia/discovery/analytics repository contains airflow DAGs, various python scripts invoked by those DAGs, and custom python environments that airflow will use to run the scripts in.

How to prepare a commit

Airflow fixtures

For many of the tasks run by airflow the test suite runs them and records the commands that would have been executed as test fixtures. This helps keep track of what exactly is changing when we deploy, and helps reviewers understand the effect of changes to the dags. These fixtures are rarely edited by hand, rather the standard workflow is to delete all fixtures are allow the test suite to rebuild them. Changes to the fixtures are then reviewed and added to the commit using standard git tooling.

Fixtures can be rebuilt using the same docker image used to run tests in CI:

 find airflow/test/fixtures -name \*.expected -exec rm '{}' ';'
 docker run -it --rm -e REBUILD_FIXTURES=yes -e XDG_CACHE_HOME=/src/.cache \
   -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v $PWD:/src:rw --user $(id -u):$(id -g) \
   --entrypoint /usr/local/bin/tox \
   docker-registry.wikimedia.org/releng/tox-pyspark:0.6.0 \
   -e airflow -- -e pytest

How to prepare a deploy

Updating python scripts

Tasks defined as python scripts in the wikimedia/discovery/analytics repository require no special preparation prior to deployment. For the environments the scripts run within, see below.

Updating python environments

If any python environments have been updated the artifacts/ directory needs to be updated as well. On a debian 9.12 (matching yarn worker nodes) instance:

 bin/make_wheels.sh
 bin/upload_wheels.py artifacts/*.whl
 git add artifacts/*.whl

The make_wheels.py script checks each directory in environments/ for requirements.txt. If present all necessary wheel's are collected into artifacts/ and a requirements-frozen.txt containing exact versioning is emitted.

The upload_wheels.py script will sync the wheels to archiva. If a wheel already exists in archiva for the same version but a different sha1 your local version will be replaced. Wheel upload requires setting up maven with WMF's Archiva configuration. This step can be run from any instance, it does not have the strict versioning requirement of the previous step.

Updating java jars

Tasks defined as java jars source their jar from the artifacts/ directory. To update a task to a new jar:

  1. Release a new version of the job jar to archiva (example).
  2. Copy {project}-{version}-jar-with-dependencies.jar from the job project target/ directory into the analytics project artifacts/ directory
  3. Remove (git rm) the old version of the jar from the artifacts/ directory
  4. Grep the repository (likely airflow/dags or airflow/config directories) for references to the old jar version, update with new version

Changes to the airflow dag will require rebuilding the fixtures. The jar copied into the repository will not be uploaded and must already exist in archiva. The local copy of the file will be used to derive the hash included in the commit. There are no CI procedures to ensure a jar added in this way actually exists in archiva. To verify manually create a new clone of the repository:

 git clone . /tmp/deleteme

pull the artifacts:

 cd /tmp/deleteme
 git fat init && git fat pull

Updating airflow dags

No special preparation is required when deploying dag updates. Newly created dags will start in the off position, updated dags will pick up their changes within a minute of deployment.

How to deploy

  1. Ssh to deployment.eqiad.wmnet
  2. Run:
    cd /srv/deployment/wikimedia/discovery/analytics
    git fetch
    git log HEAD..origin/master
    git pull --ff-only origin/master
    scap deploy 'a logged description of deployment'

    This part brings the discovery analytics code from gerrit to stat1007 and an-airflow1001. It additionally builds the virtualenv's necessary. Airflow DAGs are automatically loaded on deployment, but new dags start in the off position.

Airflow Production Deployment

Airflow dags are scheduled from an-airflow1001. After scap deployment airflow will see the updated dags within one minute.

WARNING: If any python in the airflow/plugins directory is updated by the deployment the airflow-scheduler and airflow-webserver systemd service will need to be restarted to ensure all systems use the updated plugin code.

Airflow Test Deployment

Airflow test deployment are done by creating a python virtualenv on an analytics network machine(ex: stat1006.eqiad.wmnet) within the testers home directory. All Dag input/output paths are controlled by variables in the airflow/config directory. These variables are automatically deployed during scap deployment, but will not be automatically available in a test deployment. Variables can be adjusted to appropriate test paths and imported to the test instance to allow performing test invocations.

Alternatively, many airflow tasks result in essentially executing a shell command (spark-submit, hive, skein, etc.) and waiting for it to complete. When possible prefer to test running those shell commands directly with modified command lines over performing airflow test deployments. Deeper or more complex (mjolnir) dags may require test deployment due to the number of tasks involved.

Oozie Production Deployment

Oozie jobs are deployed from stat1007. The following environment variables are used to kick off all jobs:

  • REFINERY_VERSION should be set to the concrete, 'deployed' version of refinery that you want to deploy from. Like 2015-01-05T17.59.18Z--7bb7f07. (Do not use current there, or your job is likely to break when someone deploys refinery afresh).
  • DISCOVERY_VERSION should be set to the concrete, 'deployed' version of discovery analytics that you want to deploy from. Like 2016-01-22T20.19.59Z--e00dbef. (Do not use current there, or your job is likely to break when someone deploys disocvery analyitcs afresh).
  • PROPERTIES_FILE should be set to the properties file that you want to deploy; relative to the refinery root. Like oozie/popularity_score/bundle.properties.
  • START_TIME should denote the time the job should run the first time. Like 2016-01-05T11:00Z. This should be coordinated between both the popularity_score and transfer_to_es jobs so that they are asking for the same days. Generally you want to set this to the next day the job should run.

kill existing coordinator / bundle

Make sure you kill any existing coordinator / bundle before deploying a new one.

The simplest way is to search for analytics-search in https://hue.wikimedia.org/oozie/list_oozie_coordinators/ and https://hue.wikimedia.org/oozie/list_oozie_bundles/. Jobs running under your own user can be killed from the Hue interface. For jobs running under another user, such as analytics-search, you may need to copy the appropriate id from hue and run the following on stat1007 (after replacing the last argument with the id):

 sudo -u analytics-search oozie job -oozie $OOZIE_URL -kill 0000520-160202151345641-oozie-oozi-C

query_clicks

This job has 2 coordinators, an hourly one and a daily one. They are deployed independently. The analytics-search user does not currently have the right access to webrequest data, that will need to be fixed before a prod deployment.

DANGER: Double check that the namespace_map_snapshot_id still exists. If the snapshot has been deleted then the hourly jobs will not have any results.

 export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///')
 export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | grep -v -- -dirty | tail -n 1 | sed 's/^.*\///')
 export START_TIME=2016-01-05T11:00Z
 
 cd /srv/deployment/wikimedia/discovery/analytics
 for PROPERTIES_TYPE in hourly daily; do
   export PROPERTIES_FILE="oozie/query_clicks/${PROPERTIES_TYPE}/coordinator.properties"
   sudo -u analytics-search \
     kerberos-run-command analytics-search \
     oozie job \
     -oozie $OOZIE_URL \
     -run \
     -config $PROPERTIES_FILE \
     -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
     -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
     -D queue_name=default \
     -D start_time=$START_TIME \
     -D namespace_map_snapshot_id=2019-11 \
     -D send_error_email_address=discovery-alerts@lists.wikimedia.org
 done

Oozie Test Deployments

There is no hadoop cluster in beta cluster or labs, so changes have to be tested in production. When submitting a job please ensure you override all appropriate values so the production data paths and tables are not effected. After testing you job be sure to kill it (the correct one!) from hue. Note that most of the time you won't need to do a full test through oozie, you can instead call the script directly with spark-submit.

deploy test code to hdfs

 git clone http://gerrit.wikimedia.org/r/wikimedia/discovery/analytics ~/discovery-analytics
 <copy some command from the gerrit ui to pull down and checkout your patch>
 ~/discovery-analytics/bin/discovery-deploy-to-hdfs --base hdfs:///user/$USER/discovery-analytics --verbose --no-dry-run
submitting test coordinator/workflow

When submitting a coordinator or workflow with the oozie command be sure to provide appropriate paths, such as:

 -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie

Debugging

Web Based

Hue

Active oozie bundles, coordinators, and workflows can be seen in hue. Hue uses wikimedia LDAP for authentication. All production jobs for discovery are owned by the analytics-search user. This is only useful for viewing the state of things, actual manipulations need to be done by sudo'ing to the analytics-search user on stat1002.eqiad.wmnet and utilizing the CLI tools.

Container killed?

To find the logs about a killed container, specifically for a task that runs a hive script:

  • Find the Coordinator
  • Click on failed task
  • Click 'actions'
  • Click the external id of the hive task
  • Click the 'logs' button under 'Attempts' (not in the left sidebar)
  • Select 'syslog'
  • Search in page for 'beyond'. This may have a log like:

2017-03-01 00:05:57,469 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1488294419903_1313_m_000000_0: Container [pid=2002,containerID=container_e42_1488294419903_1313_01_000003] is running beyond physical memory limits. Current usage: 1.1 GB of 1 GB physical memory used; 5.0 GB of 2.1 GB virtual memory used. Killing container.

Yarn

Yarn shows the status of currently running jobs on the hadoop cluster. If you are in the wmf LDAP group (every WMF employees/contractors) you can login directly to yarn.wikimedia.org. Yarn is also accessible over a SOCKS proxy to analytics1001.eqiad.wmnet.

Currently running jobs are visible via the running applications page.

CLI

CLI commands are typically run by ssh'ing into stat1007.eqiad.wmnet

Oozie

The oozie CLI command can be used to get info about currently running bundles, coordinators, and workflows. Although often it is easier to get those from #Hue. Use oozie help for more detailed info, here are a few useful commands. Get the appropriate oozie id from hue.

Re-run a failed workflow
 sudo -u analytics-search oozie job -oozie $OOZIE_URL -rerun 0000612-160202151345641-oozie-oozi-W
Show info about running job
 oozie job -info 0000612-160202151345641-oozie-oozi-W

This will output something like the following

 Job ID : 0000612-160202151345641-oozie-oozi-W
 ------------------------------------------------------------------------------------------------------------------------------------
 Workflow Name : discovery-transfer_to_es-discovery.popularity_score-2016,2,1->http://elastic1017.eqiad.wmnet:9200-wf
 App Path      : hdfs://analytics-hadoop/wmf/discovery/2016-02-02T21.16.44Z--2b630f1/oozie/transfer_to_es/workflow.xml
 Status        : RUNNING
 Run           : 0
 User          : analytics-search
 Group         : -
 Created       : 2016-02-02 23:01 GMT
 Started       : 2016-02-02 23:01 GMT
 Last Modified : 2016-02-03 03:36 GMT
 Ended         : -
 CoordAction ID: 0000611-160202151345641-oozie-oozi-C@1
 Actions
 ------------------------------------------------------------------------------------------------------------------------------------
 ID                                                                            Status    Ext ID                 Ext Status Err Code  
 ------------------------------------------------------------------------------------------------------------------------------------
 0000612-160202151345641-oozie-oozi-W@:start:                                  OK        -                      OK         -         
 ------------------------------------------------------------------------------------------------------------------------------------
 0000612-160202151345641-oozie-oozi-W@transfer                                 RUNNING   job_1454006297742_16752RUNNING    -         
 ------------------------------------------------------------------------------------------------------------------------------------

You can continue down the rabbit hole to find more information. In this case the above oozie workflow kicked off a transfer job, which matches an element of the related workflow.xml

 oozie job -info 0000612-160202151345641-oozie-oozi-W@transfer

This will output something like the following

 ID : 0000612-160202151345641-oozie-oozi-W@transfer
 ------------------------------------------------------------------------------------------------------------------------------------
 Console URL       : http://analytics1001.eqiad.wmnet:8088/proxy/application_1454006297742_16752/
 Error Code        : -
 Error Message     : -
 External ID       : job_1454006297742_16752
 External Status   : RUNNING
 Name              : transfer
 Retries           : 0
 Tracker URI       : resourcemanager.analytics.eqiad.wmnet:8032
 Type              : spark
 Started           : 2016-02-02 23:01 GMT
 Status            : RUNNING
 Ended             : -
 ------------------------------------------------------------------------------------------------------------------------------------

Of particular interest is the Console URL, and the related application id (application_1454006297742_16752). This id can be used with the yarn command to retrieve logs of application that was run. Note though that two separate jobs are created, the @transfer job shown above is an oozie runner, for spark jobs this does little to no actual work and mostly just kicks off another job to run the spark application.

List jobs by user

Don't attempt to use the oozie jobs -filter ... command, it will stall out oozie. Instead use hue and the filter's available there.

Yarn

Yarn is the actual job runner.

List running spark jobs
 yarn application -appTypes SPARK -list
Fetch application logs

the yarn application id can be used to fetch application logs

 sudo -u analytics-search yarn logs -applicationId application_1454006297742_16752 | less

Remember that each spark job kicked off by oozie has two application id's. The one reported by oozie job -info is the oozie job runner. The sub job that is actually running the spark application isn't as easy to find the ID of. Your best bet is often to poke around in the yarn web ui to find the job with the right name, such as Discovery Transfer To http://elastic1017.eqiad.wmnet:9200.

Yarn manager logs

When an executor is lost, it's typically because it ran over it's memory allocation. You can verify this by looking at the yarn manager logs for the machine that lost an executor. To fetch the logs pull them down from stat1007 and grep them for the numeric portion of the application id. For an executor lost on analytics1056 for application_1454006297742_27819:

   curl -s http://analytics1056.eqiad.wmnet:8042/logs/yarn-yarn-nodemanager-analytics1056.log | grep 1454006297742_27819 | grep WARN

It might output something like this:

   2016-02-07 06:05:06,480 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_e12_1454006297742_27819_01_000020
   2016-02-07 06:26:57,151 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_e12_1454006297742_27819_01_000011 has processes older than 1 iteration running over the configured limit. Limit=5368709120, current usage = 5420392448
   2016-02-07 06:26:57,152 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=112313,containerID=container_e12_1454006297742_27819_01_000011] is running beyond physical memory limits. Current usage: 5.0 GB of 5 GB physical memory used; 5.7 GB of 10.5 GB virtual memory used. Killing container.
   2016-02-07 06:26:57,160 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_e12_1454006297742_27819_01_000011 is : 143