You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Systems/ua-parser"

From Wikitech
Jump to navigation Jump to search
imported>Joal
(Add code examples (not finished))
 
(No difference)

Latest revision as of 13:52, 13 September 2019

This page describes our setup for ua-parser, a library used in Java and Python to parse user-agent strings into more meaningful values.

Setup

The ua-parser project uses a core repository named uap-core for shared regular-expressions and test-data, and per programming-language repositories (uap-java, uap-python for instance) referencing the core repository as a git submodule.

As of 2019-09-13, the Analytics team maintains 2 forks:

Each of the two forks we maintain has a dedicated branch with patches useful for wmf artifacts releases

  • the wmf branch for the java repository - Contains mostly updates to pom.xml and sometimes functional changes not yet merged upstream. Jar files are generated from this branch and uploaded to archiva.wikimedia.org.
  • the debian branch for python repository - Contains patches allowing to build debian packages out of the code, then uploaded to apt.wikimedia.org

How to update

Measure the change

Before updating the code in production, it is good practice to have a look at the impact the change will have on the data. To do so we use the hadoop-cluster to generate a temporary table containing both current and new versions of parsed user-agent data, and compare. Below is a rough procedure.

  1. Update your version of the uap-java code (including pulling latest version of master in uap-core submodule), build and install a local uap-java jar and use it to build an updated refinery-hive jar
    # In uap-java cloned repo, assuming you have setup a remote
    # to the original github repo named github
    git fetch --all
    git checkout github/master
    cd uap-core
    git fetch --all
    git checkout master
    git pull
    cd ..
    # Building and installing the uap-java jar locally
    mvn clean install
    # Move to the refinery-source folder
    cd /my/refinery-source
    # Update the refinery-source/pom.xml to reference your locally installed new uap-java jar
    # Build the refinery-hive jar (possibly without tests since ua-parser has changed: -DskipTests)
    mvn -pl refinery-hive -am clean package
    
  2. Generate the comparison hive table using the generated refinery-hive jar (taking a 1/64 sample of one day of webrequest data)
    -- In hive
    use MYDATABASE;
    ADD JAR /PATH/TO/MY/JAR/refinery-hive-0.0.MYJARVERSION-SNAPSHOT.jar;
    CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refinery.hive.UAParserUDF';
    
    DROP TABLE IF EXISTS tmp_ua_check_YYYY_MM_DD;
    create table tmp_ua_check_YYYY_MM_DD stored as parquet as
    select
      user_agent,
      user_agent_map as user_agent_map_original,
      ua_parser(user_agent) AS user_agent_map_new,
      COUNT(1) as requests
    FROM wmf.webrequest webrequest TABLESAMPLE(BUCKET 1 OUT OF 64 ON hostname, sequence)
    WHERE year = YYYY and month = MM and day = DD
    GROUP BY user_agent, user_agent_map, ua_parser(user_agent);
    
  3. Use the comparison table to measure differences
  4. Document the results

Update the code for production

  1. Update the java and python repositories to the needed commit (usually either a released tag, or current master), and update their submodule to the correct version of uap-core (usually current master)
  2. Rebase or cleanup the wmf and debian branches using the updated master branches in java and python repositories. This depends on the changes the branches contained in comparison to what has been merged in the upstream repository.
  3. Push new patches to the wmf and debian branches, fat least so that new version of the jar and debian packages are created.
  4. Build and release the new jar uap-java jar to archiva, and the new debian package to apt (ask Andrew or Luca).