You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Analytics/Systems/AQS/Scaling/LoadTesting
Cassandra Info
See: https://etherpad.wikimedia.org/p/analytics-aqs-cassandra
TL:DR
With our new compression scheme help us to reduce greatly the number of SSTables read upon every request and that, together with SSDs, has a strong impact on the throughput that the cluster can sustain. We are able to operate at ~ 400 transactions per second (TPS) per machine. Regardless of whether we have 3/4 months of data or 10/12 our throughput per host doesn't change.
We did not investigate as to the bottleneck so it might be that are woes have moved to the http layer.
While tests are going on CPU doesn't go beyond 20%. The machine is not taking traffic so CPU usage is only due to compaction, cassandra running and load testing.
Tests setup
We will be using siege to do load testing from inside the firewall to restbase deployed in aqs100[4,5,6]
Siege doesn't seem to be able to load a file with more than 100.000 urls thus we load test with files of that size per every test.
- Source files to test with have about 11 million urls thus we split them in smaller files with split.
- cassandra is creating debug logs while these tests are going on so it is likely we can run with a more efficient configuration
Graphs
CPU load in aqs1004: https://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=bytes_out&c=Analytics+Query+Service+eqiad&h=aqs1004.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=
SSTables read per query, new cluster:
Load Tests
Data and tests are at: nuria@aqs1004:~/load-testing
TL;DR
- From what I can see compaction doesn't affect throughput and we are capped at about 450 transactions per second per host. This number doesn't change with the amount of data loaded thus far. That is, the throughput cap is the same with 4 and 7 months of data.
SSTables per read is about 3, in our old cluster this number is 15. This means that the compaction we are using is more efficient for our type of reads.
- CPU doesn't even get to 20% while testing at a throughput of > 300 TPS for minutes at a a time.
- We have debug logging on, if logging is in any way a bottleneck our throughput might be even better.
Tests 1. 2016-08-16. Three months loaded. Compaction has finished
Results
- We test once compaction has finished before starting a different load of data on aqs1004 using siege.
The maximum throughput 1 box is able to sustain is about 350 transactions per second (tps). Not much of a blip on CPU, it doesn't go above 20%.
Detail data
Tests 2016-08-16 file with 1e6 urls
"a hundred concurrent users doing 100 repetitions", should be 1e6 transactions -c100 -r100 ** SIEGE 3.0.8 ** Preparing 100 concurrent users for battle. The server is now under siege.. done. Transactions: 10000 hits Availability: 100.00 % Elapsed time: 62.82 secs Data transferred: 2.95 MB Response time: 0.01 secs Transaction rate: 159.18 trans/sec Throughput: 0.05 MB/sec Concurrency: 2.30 Successful transactions: 9135 Failed transactions: 0 Longest transaction: 0.09 Shortest transaction: 0.00 *******Longer test: -c100 -r1000 ** SIEGE 3.0.8 ** Preparing 100 concurrent users for battle. The server is now under siege.. done. Transactions: 100000 hits Availability: 100.00 % Elapsed time: 548.67 secs Data transferred: 28.91 MB Response time: 0.01 secs Transaction rate: 182.26 trans/sec Throughput: 0.05 MB/sec Concurrency: 2.55 Successful transactions: 92589 Failed transactions: 0 Longest transaction: 0.22 Shortest transaction: 0.00 ********* bumping up concurrency Still cpu not even 20% nuria@aqs1004:~/load-testing$ siege -c 200 -r 1000 --file=part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 551.68 secs Data transferred: 57.82 MB Response time: 0.01 secs Transaction rate: 362.53 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.04 Successful transactions: 185178 Failed transactions: 0 Longest transaction: 0.22 Shortest transaction: 0.00 ******* siege -c 400 -r 1000 --file=part-00000_1_1e6.txt --log Not working. ******** siege -c 300 -r 1000 --file=part-00000_1_1e6.txt --log Not working either!!! Looks like 300 TPS is our threshold.
Tests 1. 2016-08-18. Four months loaded. Compaction is happening while testing
Results
Compaction doesn't seem to affect test results when it comes to throughput. Average response time is slower (is "response time" a percentile or an average?). Twice as slow actually.
Number of SSTtables read is still around 3 and CPU is low while tests are running (<20%) while throughput is about 300 TPS.
Details
200 concurrent users doing 1000 repetitions ** siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 199870 hits Availability: 99.94 % Elapsed time: 563.90 secs Data transferred: 57.81 MB Response time: 0.02 secs Transaction rate: 354.44 trans/sec Throughput: 0.10 MB/sec Concurrency: 7.01 Successful transactions: 185055 Failed transactions: 130 Longest transaction: 5.23 Shortest transaction: 0.00 It took about the same time to complete this test that when compaction was not happening, average response time is higher 0.2 versus 0.1 ** 2nd time same set of urls: nuria@aqs1004:~/load-testing/2016-08-18$ siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 559.88 secs Data transferred: 57.82 MB Response time: 0.02 secs Transaction rate: 357.22 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.84 Successful transactions: 185178 Failed transactions: 0 Longest transaction: 0.33 Shortest transaction: 0.00 Average response time is same and so is TPS. ** 3rd time different set of urls results are about identical to 1st run, slower transaction seems slower but averages are about the same nuria@aqs1004:~/load-testing/2016-08-18$ siege -c 200 -r 1000 --file=../part-00000_2_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 199999 hits Availability: 100.00 % Elapsed time: 560.76 secs Data transferred: 58.84 MB Response time: 0.02 secs Transaction rate: 356.66 trans/sec Throughput: 0.10 MB/sec Concurrency: 7.32 Successful transactions: 186271 Failed transactions: 1 Longest transaction: 5.01 Shortest transaction: 0.00 Same values for response time and throughput
Tests 1. 2016-08-26. Seven months loaded. Compaction is happening while testing
Results
Compaction doesn't seem to affect throughput (per our first test on 2016-08-16) We can bump up TPS to 450 per sec
Using the same set of urls over does not allow us to increase our troughput but it just seems the slowest transaction is not as slow. This might be a red herring if those numbers are affected by outliers.
CPU usage doesn't get to 20% while tests are going on (keep in mind that compaction is happening). SSTables per read is at 3.
Details
siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 566.03 secs Data transferred: 57.82 MB Response time: 0.02 secs Transaction rate: 353.34 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.41 Successful transactions: 185178 Failed transactions: 0 Longest transaction: 0.27 Shortest transaction: 0.00 ** 2nd time same set of urls: Results are about the same, less of a blip on CPU nuria@aqs1004:~/load-testing/2016-08-24$ siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 558.60 secs Data transferred: 57.82 MB Response time: 0.01 secs nuria@aqs1004:~/load-testing/2016-08-24$ more load-testing.log 7 months loaded, data is being compacted CPU usage doesn't get to 20%, compaction doesn't seem to affect throughput (per our first test on 2016-08-16) We can bump up TPS to 450 per sec siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 566.03 secs Data transferred: 57.82 MB Response time: 0.02 secs Transaction rate: 353.34 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.41 Successful transactions: 185178 Failed transactions: 0 Longest transaction: 0.27 Shortest transaction: 0.00 ** 2nd time same set of urls: Results are about the same, less of a blip on CPU nuria@aqs1004:~/load-testing/2016-08-24$ siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 558.60 secs Data transferred: 57.82 MB Response time: 0.01 secs Transaction rate: 358.04 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.09 Successful transactions: 185178 Failed transactions: 0 Longest transaction: 0.26 Shortest transaction: 0.00 ** 3rd run, different set of urls Pretty much exact results than round 1 siege -c 200 -r 1000 --file=../part-00000_2_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 561.96 secs Data transferred: 58.84 MB Response time: 0.01 secs Transaction rate: 355.90 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.23 Successful transactions: 186272 Failed transactions: 0 Longest transaction: 0.21 Shortest transaction: 0.00 ** 4th run try to bunp up concurrency: siege -c 300 -r 1000 --file=../part-00000_2_1e6.txt --log Test breaks, transactions per sec are too much ** 5th run bumping up concurrency, but a bit less, we bump up transactions to 400 nuria@aqs1004:~/load-testing/2016-08-24$ siege -c 250 -r 1000 --file=../part-00000_2_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 250 concurrent users for battle. The server is now under siege.. done. Transactions: 250000 hits Availability: 100.00 % Elapsed time: 557.41 secs Data transferred: 73.59 MB Response time: 0.01 secs Transaction rate: 448.50 trans/sec Throughput: 0.13 MB/sec Concurrency: 5.98 Successful transactions: 232779 Failed transactions: 0 Longest transaction: 0.21 Shortest transaction: 0.00 ** 6th run using a different set of urls, fresh nuria@aqs1004:~/load-testing/2016-08-24$ siege -c 250 -r 1000 --file=../part-00000_3_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 250 concurrent users for battle. The server is now under siege.. done. Transactions: 250000 hits Availability: 100.00 % Elapsed time: 560.55 secs Data transferred: 73.35 MB Response time: 0.02 secs Transaction rate: 445.99 trans/sec Throughput: 0.13 MB/sec Concurrency: 7.04 Successful transactions: 233363 Failed transactions: 0 Longest transaction: 0.24 Shortest transaction: 0.00
Tests 2016-09-17. All data loaded, deployed fix for nulls to zero. No compaction happening
Results
Throughput limit is not changed regardless of data size. It is likely that we are now running into a limit with our http layer rather than storage layer.
CPU usage doesn't get to 20% while tests are going on. SSTables per read is at 3.
Details
Test 1, 1e6 distinct urls. 200 users, 100 repetitions, which means urls are hit more than once siege -c 200 -r 1000 --file=../part-00000 ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 556.29 secs Data transferred: 58.90 MB Response time: 0.02 secs Transaction rate: 359.52 trans/sec Throughput: 0.11 MB/sec Concurrency: 6.78 Successful transactions: 186310 Failed transactions: 0 Longest transaction: 0.50 Shortest transaction: 0.00 Test 2, same setup but with 200.000 distinct urls. Results are about identical nuria@aqs1004:~/load-testing/2016-09-17$ siege -c 200 -r 1000 --file=./urls_2e6.txt.2 ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 561.82 secs Data transferred: 58.04 MB Response time: 0.02 secs Transaction rate: 355.99 trans/sec Throughput: 0.10 MB/sec Concurrency: 6.71 Successful transactions: 185881 Failed transactions: 0 Longest transaction: 0.26 Shortest transaction: 0.00 Tests 3. Bumping up transactions per second. siege -c 300 -r 1000 --file=./urls_2e6.txt.2 This results on transactions per second > 500 and things break Test 4. Reduced concurrent users to 250. Actually we get to > 400 throughput siege -c 250 -r 1000 --file=./urls_2e6.txt.2 ** SIEGE 3.0.8 ** Preparing 250 concurrent users for battle. The server is now under siege.. done. Transactions: 250000 hits Availability: 100.00 % Elapsed time: 562.85 secs Data transferred: 72.71 MB Response time: 0.02 secs Transaction rate: 444.17 trans/sec Throughput: 0.13 MB/sec Concurrency: 7.84 Successful transactions: 232861 Failed transactions: 0 Longest transaction: 0.29 Shortest transaction: 0.00 Test 5. Same users and repetitions than Test 4 but with urls that expand to a larger time range. Results are similar to prior test. nuria@aqs1004:~/load-testing/2016-09-17$ siege -c 250 -r 1000 --file=./urls_2e6.txt.2.late_end_date ** SIEGE 3.0.8 ** Preparing 250 concurrent users for battle. The server is now under siege.. done. Transactions: 250000 hits Availability: 100.00 % Elapsed time: 556.65 secs Data transferred: 83.56 MB Response time: 0.02 secs Transaction rate: 449.12 trans/sec Throughput: 0.15 MB/sec Concurrency: 8.72 Successful transactions: 232609 Failed transactions: 0 Longest transaction: 0.29 Shortest transaction: 0.00
Calibration. Performance Tests on Old Cluster
We have used siege to run tests on the old cluster in order to get a baseline of siege results compare to results we see in the live system. We know from experience that our old cluster couldn't sustain more than 30 reqs per second due to hardware (lack of SSDs) but also to software (suboptimal cassandra compaction).
We tested the old cluster with data up to 2016-09-23 but w/o receiving production traffic so our setup is identical to the performance tests we run on the newer cluster
TL;DR
Latencies are way higher for much smaller concurrency levels. Response times are aprox 2 seconds for less than 60 transactions per second. This means that latencies (as measured by our tests) are 100 times higher on the old cluster with similar (and lower) concurrency levels.
Cluster cannot even get to 100 transactions per sec. In theory the load tests tell us it that it could support almost 60. CPU gets to 20%. In reality we know our threshold is around 30 so we can safely conclude that siege allows us to find the right order of magnitude, the actual threshold has to be found experimentally.
Details
nuria@aqs1001:~/load-testing$ siege -c 120 -r 1000 --file=urls_2e6.txt.2.late_end_date ** SIEGE 3.0.8 ** Preparing 120 concurrent users for battle. The server is now under siege.. done. Transactions: 119593 hits Availability: 99.66 % Elapsed time: 2278.04 secs Data transferred: 40.07 MB Response time: 1.59 secs Transaction rate: 52.50 trans/sec <- Throughput: 0.02 MB/sec Concurrency: 83.61 Successful transactions: 111310 Failed transactions: 407 Longest transaction: 5.48 Shortest transaction: 0.00 nuria@aqs1001:~/load-testing$ siege -c 150 -r 1000 --file=urls_2e6.txt.2.late_end_date ** SIEGE 3.0.8 ** Preparing 150 concurrent users for battle. The server is now under siege.. done. siege aborted due to excessive socket failure; you can change the failure threshold in $HOME/.siegerc Transactions: 42589 hits Availability: 97.63 % Elapsed time: 723.67 secs Data transferred: 14.13 MB Response time: 2.03 secs Transaction rate: 58.85 trans/sec <- Throughput: 0.02 MB/sec Concurrency: 119.26 Successful transactions: 39333 Failed transactions: 1035 Longest transaction: 5.89 Shortest transaction: 0.00
Trying to bump up concurrency beyond this level runs into errors.