You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Discovery/Analytics"

From Wikitech-static
Jump to navigation Jump to search
imported>EBernhardson
 
imported>Smalyshev
Line 21: Line 21:


== How to deploy Oozie production jobs ==
== How to deploy Oozie production jobs ==
On <code>stat1002</code>, run
Oozie jobs are deployed from <code>stat1002</code>. The following environment variables are used to kick off all jobs:


   export DISCOVERY_VERSION=$(ls /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1)
* <code>REFINERY_VERSION</code> should be set to the concrete, 'deployed' version of refinery that you want to deploy from. Like <code>2015-01-05T17.59.18Z--7bb7f07</code>. (Do not use <code>current</code> there, or your job is likely to break when someone deploys refinery afresh).
   export REFINERY_VERSION=$(ls /mnt/hdfs/wmf/refinery/20* | sort | tail -n 1)
* <code>DISCOVERY_VERSION</code> should be set to the concrete, 'deployed' version of discovery analytics that you want to deploy from. Like <code>2016-01-22T20.19.59Z--e00dbef</code>. (Do not use <code>current</code> there, or your job is likely to break when someone deploys disocvery analyitcs afresh).
* <code>PROPERTIES_FILE</code> should be set to the properties file that you want to deploy; relative to the refinery root. Like <code>oozie/popularity_score/bundle.properties</code>.
* <code>START_TIME</code> should denote the time the job should run the first time. Like <code>2016-01-05T11:00Z</code>. This should be coordinated between both the popularity_score and transfer_to_es jobs so that they are asking for the same days. Generally you want to set this to the next day the job should run.
 
=== popularity_score ===
   export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///')
   export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | tail -n 1 | sed 's/^.*\///')
   export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties
   export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties
   export START_TIME=2016-01-05T11:00Z
   export START_TIME=2016-01-05T11:00Z
Line 38: Line 44:
     -D start_time=$START_TIME
     -D start_time=$START_TIME


, where
=== transfer_to_es ===
 
The firewall between analytics and codfw is not yet opened up, so this adjusts the properties
to run the bundle as a coordinator


* <code>REFINERY_VERSION</code> should be set to the concrete, 'deployed' version of refinery that you want to deploy from. Like <code>2015-01-05T17.59.18Z--7bb7f07</code>. (Do not use <code>current</code> there, or your job is likely to break when someone deploys refinery afresh).
  export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///')
* <code>DISCOVERY_VERSION</code> should be set to the concrete, 'deployed' version of discovery analytics that you want to deploy from. Like <code>2016-01-22T20.19.59Z--e00dbef</code>. (Do not use <code>current</code> there, or your job is likely to break when someone deploys disocvery analyitcs afresh).
  export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | tail -n 1 | sed 's/^.*\///')
* <code>PROPERTIES_FILE</code> should be set to the properties file that you want to deploy; relative to the refinery root. Like <code>oozie/popularity_score/bundle.properties</code>.
  export PROPERTIES_FILE=oozie/transfer_to_es/bundle.properties
* <code>START_TIME</code> should denote the time the job should run the first time. Like <code>2016-01-05T11:00Z</code>.
  export START_TIME=2016-01-05T11:00Z
 
  cd /mnt/hdfs/wmf/discovery/$DISCOVERY_VERSION
  sudo -u analytics-search oozie job \
    -oozie http://analytics1027.eqiad.wmnet:11000/oozie \
    -run \
    -config $PROPERTIES_FILE \
    -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
    -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
    -D queue_name=production \
    -D start_time=$START_TIME \
    -D oozie.bundle.application.path= \
    -D oozie.coord.application.path=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie/transfer_to_es/coordinator.xml \
    -D elasticsearch_url=http://elastic1017.eqiad.wmnet:9200 \


== Oozie Test Deployments ==
== Oozie Test Deployments ==
Line 90: Line 112:
             -D oozie.bundle.application.path= \
             -D oozie.bundle.application.path= \
             -D oozie.coord.application.path=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie/transfer_to_es/coordinator.xml
             -D oozie.coord.application.path=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie/transfer_to_es/coordinator.xml
[[Category:Discovery]]

Revision as of 23:14, 2 February 2016

Discovery uses the Analytics Cluster to support CirrusSearch. The source code is in the wikimedia/discovery/analytics repository.

How to deploy

  1. Ssh into Tin
  2. Run:
    cd /srv/deployment/wikimedia/discovery/analytics
    git deploy start
    git checkout master
    git pull
    git deploy sync

    (git deploy sync will complain that only “2/3 minions completed fetch”. You can say “y”es to that)

    This part brings the refinery code from gerrit to stat1002.
  3. Ssh into stat1002
  4. Run sudo -u analytics-search /srv/deployment/wikimedia/discovery/analytics/bin/discovery-deploy-to-hdfs --verbose --no-dry-run

    This part brings the refinery code to the HDFS (but it does not resubmit Oozie jobs).

How to deploy Oozie production jobs

Oozie jobs are deployed from stat1002. The following environment variables are used to kick off all jobs:

  • REFINERY_VERSION should be set to the concrete, 'deployed' version of refinery that you want to deploy from. Like 2015-01-05T17.59.18Z--7bb7f07. (Do not use current there, or your job is likely to break when someone deploys refinery afresh).
  • DISCOVERY_VERSION should be set to the concrete, 'deployed' version of discovery analytics that you want to deploy from. Like 2016-01-22T20.19.59Z--e00dbef. (Do not use current there, or your job is likely to break when someone deploys disocvery analyitcs afresh).
  • PROPERTIES_FILE should be set to the properties file that you want to deploy; relative to the refinery root. Like oozie/popularity_score/bundle.properties.
  • START_TIME should denote the time the job should run the first time. Like 2016-01-05T11:00Z. This should be coordinated between both the popularity_score and transfer_to_es jobs so that they are asking for the same days. Generally you want to set this to the next day the job should run.

popularity_score

 export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///')
 export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | tail -n 1 | sed 's/^.*\///')
 export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties
 export START_TIME=2016-01-05T11:00Z
 
 cd /mnt/hdfs/wmf/discovery/$DISCOVERY_VERSION
 sudo -u analytics-search oozie job \
   -oozie http://analytics1027.eqiad.wmnet:11000/oozie \
   -run \
   -config $PROPERTIES_FILE \
   -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
   -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
   -D queue_name=production \
   -D start_time=$START_TIME

transfer_to_es

The firewall between analytics and codfw is not yet opened up, so this adjusts the properties to run the bundle as a coordinator

 export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///')
 export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | tail -n 1 | sed 's/^.*\///')
 export PROPERTIES_FILE=oozie/transfer_to_es/bundle.properties
 export START_TIME=2016-01-05T11:00Z
 
 cd /mnt/hdfs/wmf/discovery/$DISCOVERY_VERSION
 sudo -u analytics-search oozie job \
   -oozie http://analytics1027.eqiad.wmnet:11000/oozie \
   -run \
   -config $PROPERTIES_FILE \
   -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \
   -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
   -D queue_name=production \
   -D start_time=$START_TIME \
   -D oozie.bundle.application.path= \
   -D oozie.coord.application.path=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie/transfer_to_es/coordinator.xml \
   -D elasticsearch_url=http://elastic1017.eqiad.wmnet:9200 \

Oozie Test Deployments

There is no hadoop cluster in beta cluster or labs, so changes have to be tested in production. When submitting a job please ensure you override all appropriate values so the production data paths and tables are not effected. After testing you job be sure to kill it (the correct one!) from hue. Note that most of the time you won't need to do a full test through oozie, you can instead call the script directly with spark-submit.

deploy test code to hdfs

 git clone http://gerrit.wikimedia.org/r/wikimedia/discovery/analytics ~/discovery-analytics
 <copy some command from the gerrit ui to pull down and checkout your patch>
 ~/discovery-analytics/bin/discovery-deploy-to-hdfs --base hdfs:///user/$USER/discovery-analytics --verbose --no-dry-run

popularity_score

 export DISCOVERY_VERSION=current
 export ANALYTICS_VERSION=current
 export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties
 cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION
 oozie job -oozie http://analytics1027.eqiad.wmnet:11000/oozie \
           -run \
           -config $PROPERTIES_FILE \
           -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \
           -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
           -D start_time=2016-01-22T00:00Z \
           -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \
           -D popularity_score_table=$USER.discovery_popularity_score

transfer_to_es

 export DISCOVERY_VERSION=current
 export ANALYTICS_VERSION=current
 export PROPERTIES_FILE=oozie/transfer_to_es/bundle.properties
 cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION
 oozie job -oozie http://analytics1027.eqiad.wmnet:11000/oozie \
           -run \
           -config $PROPERTIES_FILE \
           -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \
           -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \
           -D start_time=2016-01-22T00:00Z \
           -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \
           -D elasticsearch_url=http://stat1002.eqiad.wmnet:9876 \
           -D spark_number_executors=3 \
           -D popularity_score_table=$USER.discovery_popularity_score \
           -D oozie.bundle.application.path= \
           -D oozie.coord.application.path=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie/transfer_to_es/coordinator.xml