You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/AQS: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Mforns
(→‎Step 2: Deploy using scap: adding reference to deployment message in scap)
imported>Fdans
Line 64: Line 64:
====Issues with git review====
====Issues with git review====
It uses git review only if you pass it the --review param, omit it and it will not try to submit patch, it will commit it but it will not be pushed. Sometimes the build hangs.  In this case, check the sync-repo branch of the deploy repository.  It should have the commit in there and that can be pushed to gerrit.  It's ok to kill the build if it's been hanging for a while.
It uses git review only if you pass it the --review param, omit it and it will not try to submit patch, it will commit it but it will not be pushed. Sometimes the build hangs.  In this case, check the sync-repo branch of the deploy repository.  It should have the commit in there and that can be pushed to gerrit.  It's ok to kill the build if it's been hanging for a while.
==== NPM vulnerabilities ====
Whenever possible, it is convenient to run <code>npm audit</code> and make sure that no dependencies pose a threat to the service. Most vulnerabilities will be solved by upgrading packages, but in some cases they will correspond to a second or third-level dependency that can only be upgraded by forcing versions in <code>package-lock.json</code> . Forcing versions can be avoided if you are certain that the code carrying the vulnerability will not be run by AQS (task [[phab:T207945|T207945]] is an example of this). If this is not the case, you can enforce the new version by editing <code>package-lock.json</code> and making sure that the version change doesn't break tests.
NPM has more information about dealing with vulnerabilities [https://docs.npmjs.com/getting-started/running-a-security-audit here].


=== Step 2: Deploy using scap ===
=== Step 2: Deploy using scap ===

Revision as of 15:34, 6 November 2018

The Analytics Query Service (AQS) is a public facing API that serves analytics data.

Hosted API

More and up to date info in: https://wikimedia.org/api/rest_v1/?doc#/

Scaling: Settings, Failover and Capacity Projections

Monitoring

Grafana dashboards:


Throttling

2016-05-26

Sum up:Throttling is enforced at the restbase/AQS layer thus requests that are served by varnish are not throttled. This is an important point. It means that the throughput of the API in the top endpoints is real high as the same data is requested over and over as on those endpoints we mostly serve"daily top" data. Throttling is done per (IP/endpoint/second) and if a client breaks throttling limits it will receive a 429 response code to its http request.

At the time of this writing throttling is set to trigger at reqs per (IP/endpoint/second) and thus far we are only logging when limits are breached, we are not enforcing throttling quite yet. Why? Cause if we get more that 30 concurrent requests in cassandra at any one time cassandra lookups time out. This, likely, will not be true after we finish our work in scaling the storage layer of the API.

Ticket in which we discussed throttling: [1]

Throttling limits breached are logged in to: https://logstash.wikimedia.org/#/dashboard/temp/AVTsUtpi_LTxu7wlBfI-

Config for throttling is at: https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml


2016-09-21

Bumping up throttling limits after scaling work to 400 requests per second, after our load tests. See: Analytics/AQS/Scaling#Load_testing

Deployment

This step-by-step serves for deploying to both staging (beta) and production. Watch out for specific differences between beta and prod in each step of this section.

Step 1: Update the aqs deploy repository

Note: Be aware that this process requires having Docker installed as an instantiation of docker is done when building.

Note: Even if you're deploying to staging (beta), the code you want to deploy should be merged to master. Otherwise, the whole deployment process won't work.

  • If it's the first time you deploy:
    • Get the deploy repository: git clone ssh://$USER@gerrit.wikimedia.org:29418/analytics/aqs/deploy .
    • Make sure AQS source git repo has the deploy.dir config variable set (see Services/FirstDeployment#Local Git).
  • Run npm install in the source repository and make sure that no error is returned. Do also the same thing with npm test
  • Are you deploying a new endpoint? You need to add a bit of code to the fake data script that matches the x-amples definition in AQS's v1 yaml. Otherwise endpoint checks will fail on deployment.
  • Then (regardless if first time or not):
    • Follow Services/Deployment#Preparing_the_Deploy_Repository (basically, run ./server.js build --deploy-repo --force --review -c config.test.yaml in the source folder).
    • Check that src's sha1 in the review corresponds to the code you want to deploy).
    • Merge the newly created change to aqs deploy repo to master.

Issues with "src" path

Remove src path from deploy repo.

Issues with git review

It uses git review only if you pass it the --review param, omit it and it will not try to submit patch, it will commit it but it will not be pushed. Sometimes the build hangs. In this case, check the sync-repo branch of the deploy repository. It should have the commit in there and that can be pushed to gerrit. It's ok to kill the build if it's been hanging for a while.

NPM vulnerabilities

Whenever possible, it is convenient to run npm audit and make sure that no dependencies pose a threat to the service. Most vulnerabilities will be solved by upgrading packages, but in some cases they will correspond to a second or third-level dependency that can only be upgraded by forcing versions in package-lock.json . Forcing versions can be avoided if you are certain that the code carrying the vulnerability will not be run by AQS (task T207945 is an example of this). If this is not the case, you can enforce the new version by editing package-lock.json and making sure that the version change doesn't break tests.

NPM has more information about dealing with vulnerabilities here.

Step 2: Deploy using scap

  • Tell the #wikimedia-analytics and #wikimedia-operations IRC channels that you are deploying (use !log for instance)
  • Ssh into the deployment machine that suits your needs:
    • For staging (beta) use: deployment-tin.deployment-prep.eqiad.wmflabs.
    • For production use: deployment.eqiad.wmnet .
  • Execute scap:
    • cd /srv/deployment/analytics/aqs/deploy
    • git pull
    • git submodule update --init
    • scap deploy "YOUR DEPLOYMENT MESSAGE"
    • [optional] To see more detailed error logs during deployment, run scap deploy-log from /srv/deployment/analytics/aqs/deploy while you deploy.

Note: after T156049 scap will deploy only to aqs1004 (or deployment-aqs01 in case of beta) as first step (canary) and it will ask for confirmation before proceeding to the rest of the cluster. After that, it will deploy to one host at the time serially. You can force scap to ask for confirmation after each host or not, but telling him to proceed to all the other hosts (after the canary) will not cause a deployment to all of them at the same time, since the previously mentioned constraint will hold. Each host will be de-pooled from the load-balancer before the aqs restart, and re-pooled after that.

Step 3: Test

Staging (beta)

Beta thus far just has a modest dataset with pageviews to Barack Obama page in 2016 from es.wikipedia, en.wikipedia and de.wikipedia

You can run some queries like the following to see that aqs is running well:

 wget http://localhost:7232/analytics.wikimedia.org/v1/pageviews/ 
 curl  http://localhost:7232/analytics.wikimedia.org/v1/pageviews/per-article/de.wikipedia/all-access/all-agents/Barack_Obama/daily/2016010100/2016020200

Should return daily records

curl  http://localhost:7232/analytics.wikimedia.org/v1/pageviews/per-article/de.wikipedia/all-access/all-agents/Barack_Obama/monthly/2016010100/2016020200

Should return monthly records

curl  http://localhost:7232/analytics.wikimedia.org/v1/pageviews/aggregate/en.wikipedia/all-access/all-agents/daily/2015100100/2016103100

Should return aggreggate data for en.wikipedia, if any

curl curl http://localhost:7232/analytics.wikimedia.org/v1/pageviews/aggregate/es.wikipedia/all-access/all-agents/monthly/2015100100/2016103100

Should return monthly aggreggate data for en.wikipedia

Production

From (one of) the deployed machine, run /srv/deployment/analytics/aqs/deploy/test/test_local_aqs_urls.sh.

Troubleshooting Deployment

Issues with deployment to labs deploy

had to:

SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service deployment-aqs01.deployment-prep.eqiad.wmflabs

Issues with scap

  • Depool machine
  • Delete deployment directory
  • Run puppet
  • Try to deploy again.

Check deploy logs:

scap deploy-log -v

Check AQS logs:

sudo journalctl -u aqs 

Journalctl might not have a lot of information since by default Restbase is configured to push logs to logstash. So in order to disable this behavior, remove the following from the AQS configuration file under /etc:

logging:
  name: aqs
  level: warn
  streams:
-  # XXX: Use gelf-stream -> logstash
-  - type: gelf
-    host: localhost
-    port: 12201

Manual AQS restart:

sudo systemctl restart aqs

Administration

Cassandra CLI

Cqlsh is a python-based CLI for executing Cassandra Query Language commands. To start cqlsh in beta (password is public, this is labs)

cqlsh deployment-aqs01.deployment-prep.eqiad.wmflabs -u cassandra -p cassandra

or

ssh deployment-aqs01.deployment-prep.eqiad.wmflabs
cqlsh -u cassandra -p cassandra deployment-aqs01

Load data into cassandra in beta

  1. Generate a CSV with the data you want to load. You have basically 2 options:
    • Generate it via a query to production cassandra:
      cqlsh -u user -p pwd aqs1004-a -e "select * from \"local_group_default_T_pageviews_per_article_flat\".data where article='Barack_Obama' and timestamp >'2016010100' and timestamp <'2017012000' and project in ('en.wikipedia', 'de.wikipedia' , 'jp.wikipedia', 'es.wikipedia')  and granularity='daily'  and \"_domain\"='analytics.wikimedia.org' " > out.csv
    • Generate it yourself. If you do this, take the following into account:
      • You have to include the underscore-prefixed columns of your table, i.e. _domain or _tid. Look at the table description to get them.
      • The _tid column needs to have valid timeuuid values. You can grab an existing _tid value from the data that is already loaded in another table. It's OK for testing purposes to give the same _tid value to all rows.
      • The column _del (that exists in all tables) must be left out. It should not be populated, otherwise the table will interpret the record as deleted.
      • Be careful with the values you insert. The COPY command checks for data types, but not for value correctness (as an insert statement would). So if you insert values that do not match the possible options for that column, your queries may not find the data. Example: If a column accepts a string among (a, b, c), and you give it d, the COPY command will not complain, but you'll not be able to find any data with your queries.
      • Clean out all NULL values, cassandra is super picky about this.
      • The CSV should have no header.
  2. move data into beta (deployment-aqs01.deployment-prep.eqiad.wmflabs)
  3. load data into beta using cqlsh COPY command:
    cqlsh -u cassandra -p cassandra deployment-aqs01
    COPY "your_keyspace_name".data ("_domain", "project", "access-site", "granularity", "timestamp", "_tid", "views") from '/home/your_user/cassandra_test_input.csv';
    Edit: This was all fine an dandy until we rolled out 2.2 for cassandra, if COPY command doesn't work, you can always insert data using insert statements:
    insert into "local_group_default_T_pageviews_per_article_flat".data  ("_domain", project, article, granularity, timestamp, "_tid", "_del", aa, ab, as, au, da, db, ds, du, maa, mab, mas, mau, mwa, mwb, mws, mwu) VALUES ('analytics.wikimedia.org','de.wikipedia','Barack_Obama','daily','2016010200', 13814000-1dd2-11b2-8080-808080808080,null,3527,null,28,3499,1398,null,22,1376,145,null,null,145,1984,null,6,1978);

Restbase status

On the host to check live requests:

elukey@aqs1003:~$ sudo httpry -i eth0 tcp

Check Restbase status:

elukey@aqs1003:~$ systemctl status aqs
● aqs.service - "aqs service"
   Loaded: loaded (/lib/systemd/system/aqs.service; enabled)
   Active: active (running) since Tue 2016-05-17 15:45:58 UTC; 1 day 21h ago
 Main PID: 25226 (firejail)
   CGroup: /system.slice/aqs.service
           ├─25226 /usr/bin/firejail --blacklist=root --blacklist=/home/* --tmpfs=/tmp --caps --seccomp /usr/bin/nodejs src/server.js -c /etc/aqs/config.yaml
           ├─25227 /usr/bin/nodejs src/server.js -c /etc/aqs/config.yaml
           ├─25254 /usr/bin/nodejs /srv/deployment/analytics/aqs/deploy-cache/revs/a38e4d78718b072a70514477c3b268baaf8e1d29/src/server.js -c /etc/aqs/config.yaml
[...]
           ├─25493 /usr/bin/nodejs /srv/deployment/analytics/aqs/deploy-cache/revs/a38e4d78718b072a70514477c3b268baaf8e1d29/src/server.js -c /etc/aqs/config.yaml
           └─25504 /usr/bin/nodejs /srv/deployment/analytics/aqs/deploy-cache/revs/a38e4d78718b072a70514477c3b268baaf8e1d29/src/server.js -c /etc/aqs/config.yaml

Cassandra status

Check Cassandra cluster status (UN == Up Normal):

elukey@aqs1001:~$ nodetool status
Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns    Host ID                               Rack
UN  10.64.48.117  1.5 TB     256     ?       eb6d11b8-7a7e-4820-b56b-23869d2b79da  rack1
UN  10.64.0.123   1.49 TB    256     ?       db9cd8c1-910a-49af-9605-38af3a064788  rack1
UN  10.64.32.175  1.49 TB    256     ?       434ae715-b2a9-459b-91e9-4f29764939fd  rack1

elukey@aqs1001:~$ nodetool info
ID                     : db9cd8c1-910a-49af-9605-38af3a064788
Gossip active          : true
Thrift active          : false
Native Transport active: true
Load                   : 1.49 TB
Generation No          : 1459781811
Uptime (seconds)       : 912
Heap Memory (MB)       : 8146.42 / 16384.00
Off Heap Memory (MB)   : 3492.32
Data Center            : eqiad
Rack                   : rack1
Exceptions             : 0
Key Cache              : entries 1977574, size 364.84 MB, capacity 400 MB, 4097 hits, 78494 requests, 0.052 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 200 MB, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 50 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Token                  : (invoke with -T/--tokens to see all 256 tokens)

Please note: aqs100[456] are running two cassandra instance per node so you'll need to use nodetool-a or nodetool-b:

elukey@aqs1004:~$ nodetool-a status
Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns    Host ID                               Rack
UN  10.64.48.148  495.72 KB  256     ?       da129795-421f-439b-bd29-6a4cd9f18813  rack1
UN  10.64.48.149  552.43 KB  256     ?       e28f73cd-93c6-47e6-b046-8bbf801389f6  rack1
UN  10.64.32.189  578.18 KB  256     ?       f05db2ca-61c4-4324-8f9a-d11d3cf66e95  rack1
UN  10.64.0.126   624.22 KB  256     ?       af353a9f-0dd4-41f1-8a08-b1c7e57b2c68  rack1
UN  10.64.32.190  524.25 KB  256     ?       571af44e-23c3-4140-b59c-66fbdc16af6a  rack1
UN  10.64.0.127   610.39 KB  256     ?       06dc704b-b39b-4d2a-8d9e-81368163221f  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

Cassandra logs

Most useful one is /var/log/cassandra/system.log, that becomes -a and -b on aqs100[456] since we have two cassandra instances running:

elukey@aqs1004:/var/log/cassandra$ ls
gc-a.log.0.current  gc-b.log.0.current	system-a.log  system-b.log  system.log

Network Configuration

The AQS IPs are deployed in the Production network, meanwhile the Hadoop IPs are running in the Analytics network. The traffic flow is guarded by ACLs on switches/routers that needs to be updated if you need to connect new AQS IPs to the Analytics network. For example, this is the error that we were getting from analytics1* hosts while trying to upload data to the aqs1004-a.eqiad.wmnet Cassandra instance:

Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: 
All host(s) tried for query failed 
(tried: aqs1004-a.eqiad.wmnet/10.64.0.126:9042 (com.datastax.driver.core.TransportException: 
[aqs1004-a.eqiad.wmnet/10.64.0.126:9042] Cannot connect))

To solve the issue ops extended the existing ACL for aqs100[123].eqiad.wmnet to allow all the Cassandra Instances IPs too.


Deploy new History snapshot for Wikistats Backend

As of Q4 2018 every snapshot of mediawiki history we load into druid is a new datasource named after the snapshot. For example: "mediawiki-2018-05" AQS will not serve this data until told to do so (this is so we can actually rollback to a prior snapshot easily). In order to enable a new snapshot you need to change the hiera config for AQS that points to the active snapshot. See patch for example: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440148

Useful comands

Password

See: /etc/aqs/config.yaml

See table schema:

cassandra@cqlsh> describe table "local_group_default_T_pageviews_per_article_flat".data

Add fake data to Cassandra after wiping the cluster

cqlsh -u cassandra -p cassandra aqs1004-a 
    -f /srv/deployment/analytics/aqs/deploy/scripts/insert_monitoring_fake_data.cql

This commands will ensure that no AQS related alarm will fire.

Old procedure needed only for aqs100[123]

aqs100[123]* nodes have 12 disks. The layout is:

  • / /dev/md0 RAID 0 on sda1 and sdb1
  • swap /dev/md1 RAID 0 on sda2 and sdb2
  • /dev/md2 is LVM on RAID 10 across sda3, sdb3 and sdc1 - sdl1

partman is not smart enough to make this layout, so it has to be done manually. Assuming the raid1-30G.cfg recipe was used to install these hosts, run the following to create the desired partition layout:

#!/bin/bash

# Delete partition 3 if you have it left over from a previous installation.
for disk in /dev/sd{a,b}; do
fdisk $disk <<EOF
d
3
w
EOF
done

# Delete DataNode partitions if leftover from previous installation.
for disk in /dev/sd{c,d,e,f,g,h,i,j,k,l}; do
fdisk $disk <<EOF
d
1
w
EOF
done

# Create RAID partition 3 on sda and sdb
for disk in /dev/sd{a,b}; do
fdisk $disk <<EOF
n
p
3


t
3
fd
w
EOF
done


# Create RAID on a single partition spanning full disk for remaining 10 disks.
for disk in /dev/sd{c,d,e,f,g,h,i,j,k,l}; do
fdisk $disk <<EOF
n
p
1


t
fd
w
EOF
done

# run partprobe to refresh partition table
# (apt-get install parted)
partprobe

# Create mirrored RAID 10 on sda3, sdb3, and sdc1-sdl1
md_name=/dev/md/2
mdadm --create ${md_name} --level 10 --raid-devices=12 /dev/sda3 /dev/sdb3 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 <<EOF
y
EOF
/usr/share/mdadm/mkconf > /etc/mdadm/mdadm.conf

# set up LVM on /dev/md2 for cassandra
pvcreate /dev/md2
vgcreate "${HOSTNAME}-vg" /dev/md2
lvcreate -L 10T "${HOSTNAME}-vg" -n cassandra

# Make an ext4 filesystem on the new cassandra partition
mkfs.ext4 /dev/"${HOSTNAME}-vg"/cassandra
tune2fs -m 0 /dev/"${HOSTNAME}-vg"/cassandra

cassandra_directory=/var/lib/cassandra
mkdir -pv $cassandra_directory

# Add the LV to fstab
grep -q $cassandra_directory /etc/fstab || echo -e "# Cassandra Data Partition\n/dev/${HOSTNAME}-vg/cassandra\t${cassandra_directory}\text4\tdefaults,noatime\t0\t2" | tee -a /etc/fstab
mount $cassandra_directory