You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Discovery/Analytics
Discovery uses the Analytics Cluster to support CirrusSearch. The source code is in the wikimedia/discovery/analytics repository.
How to deploy
- Ssh into Tin
- Run:
cd /srv/deployment/wikimedia/discovery/analytics
git deploy start
git checkout master
git pull
git deploy sync
- (
git deploy sync
will complain that only “2/3 minions completed fetch”. You can say “y”es to that) - This part brings the refinery code from gerrit to
stat1002
.
- Ssh into stat1002
- Run
sudo -u analytics-search /srv/deployment/wikimedia/discovery/analytics/bin/discovery-deploy-to-hdfs --verbose --no-dry-run
- This part brings the refinery code to the HDFS (but it does not resubmit Oozie jobs).
How to deploy Oozie production jobs
Oozie jobs are deployed from stat1002
. The following environment variables are used to kick off all jobs:
REFINERY_VERSION
should be set to the concrete, 'deployed' version of refinery that you want to deploy from. Like2015-01-05T17.59.18Z--7bb7f07
. (Do not usecurrent
there, or your job is likely to break when someone deploys refinery afresh).DISCOVERY_VERSION
should be set to the concrete, 'deployed' version of discovery analytics that you want to deploy from. Like2016-01-22T20.19.59Z--e00dbef
. (Do not usecurrent
there, or your job is likely to break when someone deploys disocvery analyitcs afresh).PROPERTIES_FILE
should be set to the properties file that you want to deploy; relative to the refinery root. Likeoozie/popularity_score/bundle.properties
.START_TIME
should denote the time the job should run the first time. Like2016-01-05T11:00Z
. This should be coordinated between both the popularity_score and transfer_to_es jobs so that they are asking for the same days. Generally you want to set this to the next day the job should run.
popularity_score
export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///') export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | tail -n 1 | sed 's/^.*\///') export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties export START_TIME=2016-01-05T11:00Z cd /mnt/hdfs/wmf/discovery/$DISCOVERY_VERSION sudo -u analytics-search oozie job \ -oozie http://analytics1027.eqiad.wmnet:11000/oozie \ -run \ -config $PROPERTIES_FILE \ -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \ -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \ -D queue_name=production \ -D start_time=$START_TIME
transfer_to_es
The firewall between analytics and codfw is not yet opened up, so this adjusts the properties to run the bundle as a coordinator
export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///') export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | tail -n 1 | sed 's/^.*\///') export PROPERTIES_FILE=oozie/transfer_to_es/bundle.properties export START_TIME=2016-01-05T11:00Z cd /mnt/hdfs/wmf/discovery/$DISCOVERY_VERSION sudo -u analytics-search oozie job \ -oozie http://analytics1027.eqiad.wmnet:11000/oozie \ -run \ -config $PROPERTIES_FILE \ -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \ -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \ -D queue_name=production \ -D start_time=$START_TIME \ -D oozie.bundle.application.path= \ -D oozie.coord.application.path=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie/transfer_to_es/coordinator.xml \ -D elasticsearch_url=http://elastic1017.eqiad.wmnet:9200 \
Oozie Test Deployments
There is no hadoop cluster in beta cluster or labs, so changes have to be tested in production. When submitting
a job please ensure you override all appropriate values so the production data paths and tables are not effected.
After testing you job be sure to kill it (the correct one!) from hue. Note that most of the time you won't need to do a full test through oozie, you can instead call the script directly with spark-submit
.
deploy test code to hdfs
git clone http://gerrit.wikimedia.org/r/wikimedia/discovery/analytics ~/discovery-analytics <copy some command from the gerrit ui to pull down and checkout your patch> ~/discovery-analytics/bin/discovery-deploy-to-hdfs --base hdfs:///user/$USER/discovery-analytics --verbose --no-dry-run
popularity_score
export DISCOVERY_VERSION=current export ANALYTICS_VERSION=current export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION oozie job -oozie http://analytics1027.eqiad.wmnet:11000/oozie \ -run \ -config $PROPERTIES_FILE \ -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \ -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \ -D start_time=2016-01-22T00:00Z \ -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \ -D popularity_score_table=$USER.discovery_popularity_score
transfer_to_es
export DISCOVERY_VERSION=current export ANALYTICS_VERSION=current export PROPERTIES_FILE=oozie/transfer_to_es/bundle.properties cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION oozie job -oozie http://analytics1027.eqiad.wmnet:11000/oozie \ -run \ -config $PROPERTIES_FILE \ -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \ -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \ -D start_time=2016-01-22T00:00Z \ -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \ -D elasticsearch_url=http://stat1002.eqiad.wmnet:9876 \ -D spark_number_executors=3 \ -D popularity_score_table=$USER.discovery_popularity_score \ -D oozie.bundle.application.path= \ -D oozie.coord.application.path=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie/transfer_to_es/coordinator.xml