You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Difference between revisions of "Analytics/Systems/ua-parser"
(Add code examples (not finished))
Revision as of 13:52, 13 September 2019
This page describes our setup for ua-parser, a library used in Java and Python to parse user-agent strings into more meaningful values.
The ua-parser project uses a core repository named uap-core for shared regular-expressions and test-data, and per programming-language repositories (uap-java, uap-python for instance) referencing the core repository as a git submodule.
As of 2019-09-13, the Analytics team maintains 2 forks:
- https://gerrit.wikimedia.org/r/#/admin/projects/analytics/ua-parser/uap-java for java
- https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/python-ua-parser for python
It is an explicit choice not to maintain a fork of the uap-core repository, as we aim to always use an upstream version of the regular-expression definition and provide pull-request as needed
Each of the two forks we maintain has a dedicated branch with patches useful for wmf artifacts releases
wmfbranch for the java repository - Contains mostly updates to pom.xml and sometimes functional changes not yet merged upstream. Jar files are generated from this branch and uploaded to archiva.wikimedia.org.
debianbranch for python repository - Contains patches allowing to build debian packages out of the code, then uploaded to apt.wikimedia.org
How to update
Measure the change
Before updating the code in production, it is good practice to have a look at the impact the change will have on the data. To do so we use the hadoop-cluster to generate a temporary table containing both current and new versions of parsed user-agent data, and compare. Below is a rough procedure.
- Update your version of the uap-java code (including pulling latest version of master in uap-core submodule), build and install a local uap-java jar and use it to build an updated refinery-hive jar
# In uap-java cloned repo, assuming you have setup a remote # to the original github repo named github git fetch --all git checkout github/master cd uap-core git fetch --all git checkout master git pull cd .. # Building and installing the uap-java jar locally mvn clean install # Move to the refinery-source folder cd /my/refinery-source # Update the refinery-source/pom.xml to reference your locally installed new uap-java jar # Build the refinery-hive jar (possibly without tests since ua-parser has changed: -DskipTests) mvn -pl refinery-hive -am clean package
- Generate the comparison hive table using the generated refinery-hive jar (taking a 1/64 sample of one day of webrequest data)
-- In hive use MYDATABASE; ADD JAR /PATH/TO/MY/JAR/refinery-hive-0.0.MYJARVERSION-SNAPSHOT.jar; CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refinery.hive.UAParserUDF'; DROP TABLE IF EXISTS tmp_ua_check_YYYY_MM_DD; create table tmp_ua_check_YYYY_MM_DD stored as parquet as select user_agent, user_agent_map as user_agent_map_original, ua_parser(user_agent) AS user_agent_map_new, COUNT(1) as requests FROM wmf.webrequest webrequest TABLESAMPLE(BUCKET 1 OUT OF 64 ON hostname, sequence) WHERE year = YYYY and month = MM and day = DD GROUP BY user_agent, user_agent_map, ua_parser(user_agent);
- Use the comparison table to measure differences
- Document the results
Update the code for production
- Update the java and python repositories to the needed commit (usually either a released tag, or current master), and update their submodule to the correct version of uap-core (usually current master)
- Rebase or cleanup the
debianbranches using the updated master branches in java and python repositories. This depends on the changes the branches contained in comparison to what has been merged in the upstream repository.
- Push new patches to the
debianbranches, fat least so that new version of the jar and debian packages are created.
- Build and release the new jar uap-java jar to archiva, and the new debian package to apt (ask Andrew or Luca).