You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Discovery/Analytics: Difference between revisions
imported>EBernhardson |
imported>EBernhardson |
||
Line 203: | Line 203: | ||
Remember that each spark job kicked off by oozie has two application id's. The one reported by <code>oozie job -info</code> is the oozie job runner. The sub job that is actually running the spark application isn't as easy to find the ID of. Your best bet is often to poke around in the yarn web ui to find the job with the right name, such as <code>Discovery Transfer To http://elastic1017.eqiad.wmnet:9200</code>. | Remember that each spark job kicked off by oozie has two application id's. The one reported by <code>oozie job -info</code> is the oozie job runner. The sub job that is actually running the spark application isn't as easy to find the ID of. Your best bet is often to poke around in the yarn web ui to find the job with the right name, such as <code>Discovery Transfer To http://elastic1017.eqiad.wmnet:9200</code>. | ||
===== Yarn manager logs ===== | |||
When an executor is lost, it's typically because it ran over it's memory allocation. You can verify this by looking at the yarn manager logs for the machine that lost an executor. To fetch the logs pull them down from stat1002 and grep them for the numeric portion of the application id. For an executor lost on analytics1056 for application_1454006297742_27819: | |||
curl -s http://analytics1056.eqiad.wmnet:8042/logs/yarn-yarn-nodemanager-analytics1056.log | grep 1454006297742_27819 | grep WARN | |||
It might output something like this: | |||
2016-02-07 06:05:06,480 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_e12_1454006297742_27819_01_000020 | |||
2016-02-07 06:26:57,151 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_e12_1454006297742_27819_01_000011 has processes older than 1 iteration running over the configured limit. Limit=5368709120, current usage = 5420392448 | |||
2016-02-07 06:26:57,152 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=112313,containerID=container_e12_1454006297742_27819_01_000011] is running beyond physical memory limits. Current usage: 5.0 GB of 5 GB physical memory used; 5.7 GB of 10.5 GB virtual memory used. Killing container. | |||
2016-02-07 06:26:57,160 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_e12_1454006297742_27819_01_000011 is : 143 |
Revision as of 06:53, 7 February 2016
Discovery uses the Analytics Cluster to support CirrusSearch. The source code is in the wikimedia/discovery/analytics repository.
How to deploy
- Ssh into Tin
- Run:
cd /srv/deployment/wikimedia/discovery/analytics
git deploy start
git checkout master
git pull
git deploy sync
- (
git deploy sync
will complain that only “2/3 minions completed fetch”. You can say “y”es to that) - This part brings the refinery code from gerrit to
stat1002
.
- Ssh into stat1002
- Run
sudo -u analytics-search /srv/deployment/wikimedia/discovery/analytics/bin/discovery-deploy-to-hdfs --verbose --no-dry-run
- This part brings the refinery code to the HDFS (but it does not resubmit Oozie jobs).
How to deploy Oozie production jobs
Oozie jobs are deployed from stat1002
. The following environment variables are used to kick off all jobs:
REFINERY_VERSION
should be set to the concrete, 'deployed' version of refinery that you want to deploy from. Like2015-01-05T17.59.18Z--7bb7f07
. (Do not usecurrent
there, or your job is likely to break when someone deploys refinery afresh).DISCOVERY_VERSION
should be set to the concrete, 'deployed' version of discovery analytics that you want to deploy from. Like2016-01-22T20.19.59Z--e00dbef
. (Do not usecurrent
there, or your job is likely to break when someone deploys disocvery analyitcs afresh).PROPERTIES_FILE
should be set to the properties file that you want to deploy; relative to the refinery root. Likeoozie/popularity_score/bundle.properties
.START_TIME
should denote the time the job should run the first time. Like2016-01-05T11:00Z
. This should be coordinated between both the popularity_score and transfer_to_es jobs so that they are asking for the same days. Generally you want to set this to the next day the job should run.
popularity_score
export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///') export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | tail -n 1 | sed 's/^.*\///') export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties export START_TIME=2016-01-05T11:00Z cd /mnt/hdfs/wmf/discovery/$DISCOVERY_VERSION sudo -u analytics-search oozie job \ -oozie http://analytics1027.eqiad.wmnet:11000/oozie \ -run \ -config $PROPERTIES_FILE \ -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \ -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \ -D queue_name=production \ -D start_time=$START_TIME
transfer_to_es
The firewall between analytics and codfw is not yet opened up, so this adjusts the properties to run the bundle as a coordinator
export DISCOVERY_VERSION=$(ls -d /mnt/hdfs/wmf/discovery/20* | sort | tail -n 1 | sed 's/^.*\///') export REFINERY_VERSION=$(ls -d /mnt/hdfs/wmf/refinery/20* | sort | tail -n 1 | sed 's/^.*\///') export PROPERTIES_FILE=oozie/transfer_to_es/bundle.properties export START_TIME=2016-01-05T11:00Z cd /mnt/hdfs/wmf/discovery/$DISCOVERY_VERSION sudo -u analytics-search oozie job \ -oozie http://analytics1027.eqiad.wmnet:11000/oozie \ -run \ -config $PROPERTIES_FILE \ -D discovery_oozie_directory=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie \ -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \ -D queue_name=production \ -D start_time=$START_TIME \ -D oozie.bundle.application.path= \ -D oozie.coord.application.path=hdfs://analytics-hadoop/wmf/discovery/$DISCOVERY_VERSION/oozie/transfer_to_es/coordinator.xml \ -D elasticsearch_url=http://elastic1017.eqiad.wmnet:9200
Oozie Test Deployments
There is no hadoop cluster in beta cluster or labs, so changes have to be tested in production. When submitting
a job please ensure you override all appropriate values so the production data paths and tables are not effected.
After testing you job be sure to kill it (the correct one!) from hue. Note that most of the time you won't need to do a full test through oozie, you can instead call the script directly with spark-submit
.
deploy test code to hdfs
git clone http://gerrit.wikimedia.org/r/wikimedia/discovery/analytics ~/discovery-analytics <copy some command from the gerrit ui to pull down and checkout your patch> ~/discovery-analytics/bin/discovery-deploy-to-hdfs --base hdfs:///user/$USER/discovery-analytics --verbose --no-dry-run
popularity_score
export DISCOVERY_VERSION=current export ANALYTICS_VERSION=current export PROPERTIES_FILE=oozie/popularity_score/coordinator.properties cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION oozie job -oozie http://analytics1027.eqiad.wmnet:11000/oozie \ -run \ -config $PROPERTIES_FILE \ -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \ -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \ -D start_time=2016-01-22T00:00Z \ -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \ -D popularity_score_table=$USER.discovery_popularity_score
transfer_to_es
export DISCOVERY_VERSION=current export ANALYTICS_VERSION=current export PROPERTIES_FILE=oozie/transfer_to_es/bundle.properties cd /mnt/hdfs/user/$USER/discovery-analytics/$DISCOVERY_VERSION oozie job -oozie http://analytics1027.eqiad.wmnet:11000/oozie \ -run \ -config $PROPERTIES_FILE \ -D discovery_oozie_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie \ -D analytics_oozie_directory=hdfs://analytics-hadoop/wmf/refinery/$REFINERY_VERSION/oozie \ -D start_time=2016-01-22T00:00Z \ -D discovery_data_directory=hdfs://analytics-hadoop/user/$USER/discovery-analytics-data \ -D elasticsearch_url=http://stat1002.eqiad.wmnet:9876 \ -D spark_number_executors=3 \ -D popularity_score_table=$USER.discovery_popularity_score \ -D oozie.bundle.application.path= \ -D oozie.coord.application.path=hdfs://analytics-hadoop/user/$USER/discovery-analytics/$DISCOVERY_VERSION/oozie/transfer_to_es/coordinator.xml
Debugging
Web Based
Hue
Active oozie bundles, coordinators, and workflows can be seen in hue. Hue uses wikimedia LDAP for authentication. All production jobs for discovery are owned by the analytics-search user. This is only useful for viewing the state of things, actual manipulations need to be done by sudo'ing to the analytics-search user on stat1002.eqiad.wmnet
and utilizing the CLI tools.
Yarn
Yarn shows the status of currently running jobs on the hadoop cluster. Yarn is accessible over a SOCKS proxy to analytics1001.eqiad.wmnet
. Currently running jobs are visible via the running applications page.
CLI
CLI commands are typically run by ssh'ing into stat1002.eqiad.wmnet
Oozie
The oozie
CLI command can be used to get info about currently running bundles, coordinators, and workflows. Although often it is easier to get those from #Hue. Use oozie help
for more detailed info, here are a few useful commands. Get the appropriate oozie id from hue.
Re-run a failed workflow
sudo -u analytics-search oozie job -oozie http://analytics1027.eqiad.wmnet:11000/oozie -rerun 0000612-160202151345641-oozie-oozi-W
Show info about running job
oozie job -info 0000612-160202151345641-oozie-oozi-W
This will output something like the following
Job ID : 0000612-160202151345641-oozie-oozi-W ------------------------------------------------------------------------------------------------------------------------------------ Workflow Name : discovery-transfer_to_es-discovery.popularity_score-2016,2,1->http://elastic1017.eqiad.wmnet:9200-wf App Path : hdfs://analytics-hadoop/wmf/discovery/2016-02-02T21.16.44Z--2b630f1/oozie/transfer_to_es/workflow.xml Status : RUNNING Run : 0 User : analytics-search Group : - Created : 2016-02-02 23:01 GMT Started : 2016-02-02 23:01 GMT Last Modified : 2016-02-03 03:36 GMT Ended : - CoordAction ID: 0000611-160202151345641-oozie-oozi-C@1
Actions ------------------------------------------------------------------------------------------------------------------------------------ ID Status Ext ID Ext Status Err Code ------------------------------------------------------------------------------------------------------------------------------------ 0000612-160202151345641-oozie-oozi-W@:start: OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0000612-160202151345641-oozie-oozi-W@transfer RUNNING job_1454006297742_16752RUNNING - ------------------------------------------------------------------------------------------------------------------------------------
You can continue down the rabbit hole to find more information. In this case the above oozie workflow kicked off a transfer job, which matches an element of the related workflow.xml
oozie job -info 0000612-160202151345641-oozie-oozi-W@transfer
This will output something like the following
ID : 0000612-160202151345641-oozie-oozi-W@transfer ------------------------------------------------------------------------------------------------------------------------------------ Console URL : http://analytics1001.eqiad.wmnet:8088/proxy/application_1454006297742_16752/ Error Code : - Error Message : - External ID : job_1454006297742_16752 External Status : RUNNING Name : transfer Retries : 0 Tracker URI : resourcemanager.analytics.eqiad.wmnet:8032 Type : spark Started : 2016-02-02 23:01 GMT Status : RUNNING Ended : - ------------------------------------------------------------------------------------------------------------------------------------
Of particular interest is the Console URL, and the related application id (application_1454006297742_16752). This id can be used with the yarn command to retrieve logs of application that was run. Note though that two separate jobs are created, the @transfer job shown above is an oozie runner, for spark jobs this does little to no actual work and mostly just kicks off another job to run the spark application.
List jobs by user
Don't attempt to use the oozie jobs -filter ...
command, it will stall out oozie. Instead use hue and the filter's available there.
Yarn
Yarn is the actual job runner.
List running spark jobs
yarn application -appTypes SPARK -list
Fetch application logs
the yarn application id can be used to fetch application logs
sudo -u analytics-search yarn logs -applicationId application_1454006297742_16752 | less
Remember that each spark job kicked off by oozie has two application id's. The one reported by oozie job -info
is the oozie job runner. The sub job that is actually running the spark application isn't as easy to find the ID of. Your best bet is often to poke around in the yarn web ui to find the job with the right name, such as Discovery Transfer To http://elastic1017.eqiad.wmnet:9200
.
Yarn manager logs
When an executor is lost, it's typically because it ran over it's memory allocation. You can verify this by looking at the yarn manager logs for the machine that lost an executor. To fetch the logs pull them down from stat1002 and grep them for the numeric portion of the application id. For an executor lost on analytics1056 for application_1454006297742_27819:
curl -s http://analytics1056.eqiad.wmnet:8042/logs/yarn-yarn-nodemanager-analytics1056.log | grep 1454006297742_27819 | grep WARN
It might output something like this:
2016-02-07 06:05:06,480 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_e12_1454006297742_27819_01_000020 2016-02-07 06:26:57,151 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_e12_1454006297742_27819_01_000011 has processes older than 1 iteration running over the configured limit. Limit=5368709120, current usage = 5420392448 2016-02-07 06:26:57,152 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=112313,containerID=container_e12_1454006297742_27819_01_000011] is running beyond physical memory limits. Current usage: 5.0 GB of 5 GB physical memory used; 5.7 GB of 10.5 GB virtual memory used. Killing container. 2016-02-07 06:26:57,160 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_e12_1454006297742_27819_01_000011 is : 143