You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Discovery/Analytics

From Wikitech-static
Revision as of 05:43, 23 January 2016 by imported>EBernhardson (→‎transfer_to_es)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Discovery uses the Analytics Cluster to support CirrusSearch. The source code is in the wikimedia/discovery/analytics repository.

How to deploy

  1. Ssh into Tin
  2. Run:
    cd /srv/deployment/wikimedia/discovery/analytics
    git deploy start
    git checkout master
    git pull
    git deploy sync

    (git deploy sync will complain that only “2/3 minions completed fetch”. You can say “y”es to that)

    This part brings the refinery code from gerrit to stat1002.
  3. Ssh into stat1002
  4. Run sudo -u analytics-search /srv/deployment/wikimedia/discovery/analytics/bin/discovery-deploy-to-hdfs --verbose --no-dry-run

    This part brings the refinery code to the HDFS (but it does not resubmit Oozie jobs).

How to deploy Oozie production jobs

On stat1002, run

 export DISCOVERY_VERSION=$(ls /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1)
 export REFINERY_VERSION=$(ls /mnt/hdfs/wmf/refinery/20* | sort | tail -n 1)
 export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties
 export START_TIME=2016-01-05T11:00Z
 
 cd /mnt/hdfs/wmf/discovery/$DISCOVERY_VERSION
 sudo -u analytics-search oozie job \
   -oozie http://analytics1027.eqiad.wmnet:11000/oozie \
   -run \
   -config $PROPERTIES_FILE \
   -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
   -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
   -D queue_name=production \
   -D start_time=$START_TIME

, where

  • REFINERY_VERSION should be set to the concrete, 'deployed' version of refinery that you want to deploy from. Like 2015-01-05T17.59.18Z--7bb7f07. (Do not use current there, or your job is likely to break when someone deploys refinery afresh).
  • DISCOVERY_VERSION should be set to the concrete, 'deployed' version of discovery analytics that you want to deploy from. Like 2016-01-22T20.19.59Z--e00dbef. (Do not use current there, or your job is likely to break when someone deploys disocvery analyitcs afresh).
  • PROPERTIES_FILE should be set to the properties file that you want to deploy; relative to the refinery root. Like oozie/popularity_score/bundle.properties.
  • START_TIME should denote the time the job should run the first time. Like 2016-01-05T11:00Z.

Oozie Test Deployments

There is no hadoop cluster in beta cluster or labs, so changes have to be tested in production. When submitting a job please ensure you override all appropriate values so the production data paths and tables are not effected. After testing you job be sure to kill it (the correct one!) from hue. Note that most of the time you won't need to do a full test through oozie, you can instead call the script directly with spark-submit.

deploy test code to hdfs

 git clone http://gerrit.wikimedia.org/r/wikimedia/discovery/analytics ~/discovery-analytics
 <copy some command from the gerrit ui to pull down and checkout your patch>
 ~/discovery-analytics/bin/discovery-deploy-to-hdfs --base hdfs:///user/$USER/discovery-analytics --verbose --no-dry-run

popularity_score

 export DISCOVERY_VERSION=current
 export ANALYTICS_VERSION=current
 export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties
 cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION
 oozie job -oozie http://analytics1027.eqiad.wmnet:11000/oozie \
           -run \
           -config $PROPERTIES_FILE \
           -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \
           -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
           -D start_time=2016-01-22T00:00Z \
           -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \
           -D popularity_score_table=$USER.discovery_popularity_score

transfer_to_es

 export DISCOVERY_VERSION=current
 export ANALYTICS_VERSION=current
 export PROPERTIES_FILE=oozie/transfer_to_es/bundle.properties
 cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION
 oozie job -oozie http://analytics1027.eqiad.wmnet:11000/oozie \
           -run \
           -config $PROPERTIES_FILE \
           -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \
           -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
           -D start_time=2016-01-22T00:00Z \
           -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \
           -D elasticsearch_url=http://stat1002.eqiad.wmnet:9876 \
           -D spark_number_executors=3 \
           -D popularity_score_table=$USER.discovery_popularity_score \
           -D oozie.bundle.application.path= \
           -D oozie.coord.application.path=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie/transfer_to_es/coordinator.xml