You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:Joal/WDQS Traffic Analysis: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Joal
(Add query-time section)
 
imported>Joal
(Add User-agents with 1 daily request subsection)
 
Line 1: Line 1:
Analysis of WDQS traffic on [[Wikidata query service#Hardware|public and internal clusters]]. The charts and data have been computed on [[Analytics/Systems/Jupyter|Jupyter notebooks]] running [[Analytics/Systems/Cluster/Spark|Spark]] on the [[Analytics/Systems/Cluster|Analytics hadoop cluster]]. Processed data is events sourced through the [[Event_Platform|Modern Event Platform]]. Originally written on March 2020, reran with June 2020 for the current version of the charts.
Analysis of WDQS traffic on [[Wikidata query service#Hardware|public and internal clusters]]. The charts and data have been computed on [[Analytics/Systems/Jupyter|Jupyter notebooks]] running [[Analytics/Systems/Cluster/Spark|Spark]] on the [[Analytics/Systems/Cluster|Analytics hadoop cluster]]. Processed data is events sourced through the [[Event_Platform|Modern Event Platform]]. Originally written on March 2020, reran with June 2020 for the current version of the charts.


== Global traffic information ==
== Global traffic information ==


=== HTTP response codes ===
=== HTTP response codes ===
Line 9: Line 11:


{|
{|
| [[File:Wdqs 2020 06 public http codes per day.png|thumb|600px|center|WDQS public cluster HTTP response codes for June 2020]] || [[File:Wdqs 2020 06 internal http codes per day.png|thumb|600px|center|WDQS internal cluster HTTP response codes for June 2020]]
| style="padding-right: 30px" | [[File:Wdqs 2020 06 public http codes per day.png|thumb|600px|center|WDQS public cluster HTTP response codes for June 2020]] || [[File:Wdqs 2020 06 internal http codes per day.png|thumb|600px|center|WDQS internal cluster HTTP response codes for June 2020]]
|}
|}


Line 17: Line 19:


{|
{|
|style="padding: 0px 50px"| [[File:WDQS 2020 06 public and internal 200 totals.png|thumb|200px|center|WDQS public and internal clusters requests' for June 2020]] || || [[File:WDQS 2020 06 public internal 200 per day.png|thumb|600px|center|WDQS public and internal clusters requests' per day for June 2020]]
|style="padding-right: 30px"| [[File:WDQS 2020 06 public and internal 200 totals.png|thumb|200px|center|WDQS public and internal clusters requests' for June 2020]] || || [[File:WDQS 2020 06 public internal 200 per day.png|thumb|600px|center|WDQS public and internal clusters requests' per day for June 2020]]
|}
|}


Line 25: Line 27:


{|
{|
| [[File:WDQS 2020 06 public 200 repeated per day.png|thumb|600px|center|WDQS public clusters repeated requests per day for June 2020]] || [[File:WDQS 2020 06 internal 200 repeated per day.png|thumb|600px|center|WDQS internal cluster repeated requests for June 2020]]
| style="padding-right: 30px" | [[File:WDQS 2020 06 public 200 repeated per day.png|thumb|600px|center|WDQS public clusters repeated requests per day for June 2020]] || [[File:WDQS 2020 06 internal 200 repeated per day.png|thumb|600px|center|WDQS internal cluster repeated requests for June 2020]]
|}
|}


== Query-time ==
== Query-time ==
Line 38: Line 41:


{|
{|
|style="padding: 0px 50px"| [[File:WDQS_2020_06_public_and_internal_200_query-time_totals.png|thumb|200px|center|WDQS public and internal clusters sum of query-time for June 2020]] || || [[File:WDQS_2020_06_public_and_internal_200_query-time_per_day.png|thumb|600px|center|WDQS public and internal clusters sum of query-time per day for June 2020]]
| style="padding-right: 30px"| [[File:WDQS_2020_06_public_and_internal_200_query-time_totals.png|thumb|200px|center|WDQS public and internal clusters sum of query-time for June 2020]] || || [[File:WDQS_2020_06_public_and_internal_200_query-time_per_day.png|thumb|600px|center|WDQS public and internal clusters sum of query-time per day for June 2020]]
|}
|}


Line 47: Line 50:


'''Note:''' These charts are generated using Google Docs as the chart-system used in notebooks doesn't feature dual-axis charts.
'''Note:''' These charts are generated using Google Docs as the chart-system used in notebooks doesn't feature dual-axis charts.


{|
{|
| [[File:WDQS_2020_06_public_200_query-time_classes_and_requests.png|thumb|600px|center|WDQS public cluster sum of query-time and requests count per query-time class for June 2020]] || [[File:WDQS_2020_06_internal_200_query-time_classes_and_requests.png|thumb|600px|center|WDQS internal cluster sum of query-time and requests count per query-time class for June 2020]]
| style="padding-right: 30px" | [[File:WDQS_2020_06_public_200_query-time_classes_and_requests.png|thumb|600px|center|WDQS public cluster sum of query-time and requests count per query-time class for June 2020]] || [[File:WDQS_2020_06_internal_200_query-time_classes_and_requests.png|thumb|600px|center|WDQS internal cluster sum of query-time and requests count per query-time class for June 2020]]
|}
|}


=== Correlations (or not)===
=== Correlations (or not)===
Line 57: Line 60:
For the internal cluster, the sum of query-time is visually strongly correlated to the number of requests done (known query-class performing at regular speed). For the public cluster, there is no such correlation, due to the variety of query-classes (and implementations). Similarly, there is no visually-noticeable correlation between query-time and request-length (number of characters in the request), meaning that the query-length is not a good enough predictor of query complexity.
For the internal cluster, the sum of query-time is visually strongly correlated to the number of requests done (known query-class performing at regular speed). For the public cluster, there is no such correlation, due to the variety of query-classes (and implementations). Similarly, there is no visually-noticeable correlation between query-time and request-length (number of characters in the request), meaning that the query-length is not a good enough predictor of query complexity.
{|
{|
| [[File:WDQS_2020_06_public_200_sum_query-time_per_requests_per_day.png|thumb|600px|center|WDQS public cluster sum of query-time per requests count for every day of June 2020]] || [[File:WDQS_2020_06_internal_200_sum_query-time_per_requests_per_day.png|thumb|600px|center|WDQS internal cluster sum of query-time per requests count for every day of June 2020]]  
| style="padding-right: 30px" | [[File:WDQS_2020_06_public_200_sum_query-time_per_requests_per_day.png|thumb|600px|center|WDQS public cluster sum of query-time per requests count for every day of June 2020]] || [[File:WDQS_2020_06_internal_200_sum_query-time_per_requests_per_day.png|thumb|600px|center|WDQS internal cluster sum of query-time per requests count for every day of June 2020]]  
|-
|-
| [[File:WDQS 2020 06 public 200 sum query-length per sum query-time per requests per day.png|thumb|600px|center|WDQS public cluster sum of query-length (point size) per sum of query-time per requests count for every day of June 2020]] ||  
| [[File:WDQS 2020 06 public 200 sum query-length per sum query-time per requests per day.png|thumb|600px|center|WDQS public cluster sum of query-length (point size) per sum of query-time per requests count for every day of June 2020]] ||  
|}
== User agents ==
In this section '''some log-scale''' have been applied to facilitate readability of small value. Look at the scales!
=== Public vs Internal ===
For the public cluster, the number of daily distinct user-agents is quite variable, with most values between 20,000 and 30,000. User-agents querying the internal cluster are the expected ones, namely the WikibaseQualityConstraints tools for Wikidata and commonswiki and two tools checking that the service is up.
{|
| style="padding-right: 30px" | [[File:WDQS 2020 06 public 200 distinct user-agents count per day.png|thumb|300px|center|WDQS public cluster distinct user-agents counts per day for June 2020]] || [[File:WDQS 2020 06 internal 200 requests per user-agent per day.png|thumb|600px|center|WDQS internal cluster requests per user-agent per day for June 2020]]
|}
=== Public cluster requests-count classes ===
On the public cluster, the number of user-agents making a single request per day is by far the highest, and as the number of daily request grows the number of distinct user-agents diminishes. One way to translate that in real-world usages is that a (relatively) small number of bots each do a lot of requests per day, and quite some humans do each a small number of requests per day.
The second chart shows that most user-agents have made requests on a single day of June, less user-agents have made requests on 2 different days etc.
{|
| style="padding-right: 30px" | [[File:WDQS 2020 06 public 200 distinct user-agents count per request-count class per day.png|thumb|600px|center|WDQS public cluster distinct user-agents counts per request-count class per day for June 2020]] || [[File:WDQS 2020 06 public 200 distinct days of distinct user-agents count per request-count class.png|thumb|600px|center|WDQS public cluster distinct days of appearance of user-agents per request-count class for June 2020]]
|}
=== Public cluster request-count and max-query-time classes ===
On the public cluster, most max-query-times taking more than 10s are made by user-agents making a small number of daily queries (1 to 10), while user-agents making big number of daily requests do queries that mostly take less than 1s. No pattern emerges from looking at user-agents max-query-time per day.
{|
| style="padding-right: 30px" | [[File:WDQS 2020 06 public 200 distinct user-agents count per query-time class and request-count class.png|thumb|600px|center|WDQS public cluster distinct user-agents counts per request-count class and query-time class for June 2020]] || [[File:WDQS 2020 06 public 200 distinct user-agents count per max-query-time class per day.png|thumb|600px|center|WDQS public cluster distinct user-agents count per query-time class per day of June 2020]]
|}
=== User-agents with 1 daily request ===
Having so many user-agents making a single request per day made us wonder if they were more bot-ish or user-ish. To check that we used the fact that the raw user-agent string (the one we used for the analysis above) is parsed into a map of predefined fields using the [https://github.com/ua-parser/ ua-Parser] library.
In the next chart, ''undefined user-agents'' are user-agents where no parsed field is set, meaning this user-agent is not a usually parseable one (not user nor regular bot).
{|
| [[File:WDQS_2020_06_public_200_undefined_user-agents_requests_count.png|thumb|600px|center|WDQS public cluster undefined user-agents request count per day for June 2020]]
|}
Finally the check of not-undefined user-agents for the day of June 9th (big peak in number of distinct user-agents) shows that a big number of distinct user-agents share the same parsed user-agent value:
{| class="wikitable"
! OS !! Browser !! Device !! Unique user agents
|-
| Windows XP || IE 8 || Other || 1163
|-
| Windows XP || IE 9 || Other || 1131
|-
| Windows XP || Chrome 18 || Other || 626
|-
| Windows NT || IE 9 || Other || 619
|-
| Windows XP || Chrome 31 || Other || 618
|-
| Windows XP || Chrome 21 || Other || 613
|}
== Queries Concurrency ==
For '''this section''', only the '''public cluster''' data has been taken into consideration, and '''some log-scale'''. Also, response-codes <code>500</code> have been added to queries of this section to acknowledge for queries timing-out.
=== In-flight queries ===
The first chart shows that for active hosts (wdqs1*, instead of wdqs2*) the number of in-flight requests peaks a 4 or 5, with a very flat long-tail. The second chart shows that the sum of the query-time of in-flight requests has two modalities: the smallest is ''between 10ms and 100ms'' and the biggest ''between 100s and 1000s''. Those two modalities show that there often is at least one or two long-running queries being computing on the active hosts.
{|
| style="padding-right: 30px" | [[File:WDQS 2020 06 public 200 in-flight requests count per backend host.png|thumb|600px|center|WDQS public cluster in-flight queries count per backend host for June 2020]] || [[File:WDQS 2020 06 public 200 in-flight sum-query-time backend host.png|thumb|600px|center|WDQS public cluster in-flight sum of query-time per backend host for June 2020]]
|}
The next two charts show that it seldom happens that a backend host take a long time to process its next query, and that even when processing queries taking a long time, the number of in-flight requests doesn't grow a lot. This means that processing requests taking a long time doesn't block other small requests to be processed, which is good!
{|
| style="padding-right: 30px" | [[File:WDQS 2020 06 public 200 time to next query.png|thumb|600px|center|WDQS public cluster time-to-next query count for June 2020]] || [[File:WDQS 2020 06 public 200 in-flight sum-query-time-class per in-flight request-count for active hosts.png|thumb|600px|center|WDQS public cluster in-flight sum of query-time-class (floored log10 value) per in-flight request-count for active backend hosts for June 2020]]
|}
=== Same repeated query ===
The query <code>SELECT ?simbad ?item { ?item wdt:P3083 ?simbad VALUES ?simbad {}}</code> has been repeated 150794 times in June 2020, with an average query-time of 76ms. We use this query as a baseline and show that there is no noticeable correlation between queries concurrency and the repeated query processing-time.
{|
| style="padding-right: 30px" | [[File:WDQS 2020 06 public 200-500 repeated-simbad rounded query-time count.png|thumb|300px|center|WDQS public cluster repeated ''simbad'' query rounded query-time count for June 2020]]
| style="padding-right: 30px" | [[File:WDQS 2020 06 public 200-500 repeated-simbad rounded query-time per in-flight requests count.png|thumb|300px|center|WDQS public cluster repeated ''simbad'' query rounded query-time per in-flight requests-count for June 2020]]
| style="padding-right: 30px" | [[File:WDQS 2020 06 public 200-500 repeated simbad rounded query-time per sum-query-time.png|thumb|300px|center|WDQS public cluster repeated ''simbad'' query rounded query-time per in-flight sum of query-time for June 2020]]
| style="padding-right: 30px" | [[File:WDQS 2020 06 public 200-500 repeated simbad rounded query-time per max-in-flight-query-time.png|thumb|300px|center|WDQS public cluster repeated ''simbad'' query rounded query-time per in-flight max of query-time for June 2020]]
|}
|}

Latest revision as of 15:21, 21 July 2020

Analysis of WDQS traffic on public and internal clusters. The charts and data have been computed on Jupyter notebooks running Spark on the Analytics hadoop cluster. Processed data is events sourced through the Modern Event Platform. Originally written on March 2020, reran with June 2020 for the current version of the charts.


Global traffic information

HTTP response codes

Most requests on internal and external traffic generate HTTP result-code 200 (success). The rest of this analysis consider only requests having ended in a 200 response code, except explicitly stated otherwise.

Note: The scale of the two charts are different - See the next section for a comparison of number of requests among clusters.

WDQS public cluster HTTP response codes for June 2020
WDQS internal cluster HTTP response codes for June 2020


Public vs Internal

The number of 200 requests to the public cluster is about half of the number of requests to the internal one.

WDQS public and internal clusters requests' for June 2020
WDQS public and internal clusters requests' per day for June 2020


Distinct queries

The number of daily distinct queries for the public cluster is about 61% of the total number of queries for June 2020, and 27% for the internal cluster. This means that each request is repeated on average 0.65 times per day for the public cluster, and 2.75 times per day for the internal one.

WDQS public clusters repeated requests per day for June 2020
WDQS internal cluster repeated requests for June 2020


Query-time

One of the reasons for which analyzing query-time is interesting is as a proxy for resource-usage in the backend system: a long query is supposedly using more computation resource than a fast one.


Public vs Internal

Despite serving about 2 times more requests than the public cluster, the internal cluster has a daily sum of query-time about 10 times smaller than the public cluster.

WDQS public and internal clusters sum of query-time for June 2020
WDQS public and internal clusters sum of query-time per day for June 2020


Query-time classes

It is interesting to note that for the public cluster, the requests taking more than 10s represent a very small number of requests and take most of the processing time.

Note: These charts are generated using Google Docs as the chart-system used in notebooks doesn't feature dual-axis charts.

WDQS public cluster sum of query-time and requests count per query-time class for June 2020
WDQS internal cluster sum of query-time and requests count per query-time class for June 2020


Correlations (or not)

For the internal cluster, the sum of query-time is visually strongly correlated to the number of requests done (known query-class performing at regular speed). For the public cluster, there is no such correlation, due to the variety of query-classes (and implementations). Similarly, there is no visually-noticeable correlation between query-time and request-length (number of characters in the request), meaning that the query-length is not a good enough predictor of query complexity.

WDQS public cluster sum of query-time per requests count for every day of June 2020
WDQS internal cluster sum of query-time per requests count for every day of June 2020
WDQS public cluster sum of query-length (point size) per sum of query-time per requests count for every day of June 2020


User agents

In this section some log-scale have been applied to facilitate readability of small value. Look at the scales!

Public vs Internal

For the public cluster, the number of daily distinct user-agents is quite variable, with most values between 20,000 and 30,000. User-agents querying the internal cluster are the expected ones, namely the WikibaseQualityConstraints tools for Wikidata and commonswiki and two tools checking that the service is up.

WDQS public cluster distinct user-agents counts per day for June 2020
WDQS internal cluster requests per user-agent per day for June 2020


Public cluster requests-count classes

On the public cluster, the number of user-agents making a single request per day is by far the highest, and as the number of daily request grows the number of distinct user-agents diminishes. One way to translate that in real-world usages is that a (relatively) small number of bots each do a lot of requests per day, and quite some humans do each a small number of requests per day.

The second chart shows that most user-agents have made requests on a single day of June, less user-agents have made requests on 2 different days etc.

WDQS public cluster distinct user-agents counts per request-count class per day for June 2020
WDQS public cluster distinct days of appearance of user-agents per request-count class for June 2020


Public cluster request-count and max-query-time classes

On the public cluster, most max-query-times taking more than 10s are made by user-agents making a small number of daily queries (1 to 10), while user-agents making big number of daily requests do queries that mostly take less than 1s. No pattern emerges from looking at user-agents max-query-time per day.

WDQS public cluster distinct user-agents counts per request-count class and query-time class for June 2020
WDQS public cluster distinct user-agents count per query-time class per day of June 2020


User-agents with 1 daily request

Having so many user-agents making a single request per day made us wonder if they were more bot-ish or user-ish. To check that we used the fact that the raw user-agent string (the one we used for the analysis above) is parsed into a map of predefined fields using the ua-Parser library.

In the next chart, undefined user-agents are user-agents where no parsed field is set, meaning this user-agent is not a usually parseable one (not user nor regular bot).

WDQS public cluster undefined user-agents request count per day for June 2020

Finally the check of not-undefined user-agents for the day of June 9th (big peak in number of distinct user-agents) shows that a big number of distinct user-agents share the same parsed user-agent value:

OS Browser Device Unique user agents
Windows XP IE 8 Other 1163
Windows XP IE 9 Other 1131
Windows XP Chrome 18 Other 626
Windows NT IE 9 Other 619
Windows XP Chrome 31 Other 618
Windows XP Chrome 21 Other 613



Queries Concurrency

For this section, only the public cluster data has been taken into consideration, and some log-scale. Also, response-codes 500 have been added to queries of this section to acknowledge for queries timing-out.

In-flight queries

The first chart shows that for active hosts (wdqs1*, instead of wdqs2*) the number of in-flight requests peaks a 4 or 5, with a very flat long-tail. The second chart shows that the sum of the query-time of in-flight requests has two modalities: the smallest is between 10ms and 100ms and the biggest between 100s and 1000s. Those two modalities show that there often is at least one or two long-running queries being computing on the active hosts.

WDQS public cluster in-flight queries count per backend host for June 2020
WDQS public cluster in-flight sum of query-time per backend host for June 2020


The next two charts show that it seldom happens that a backend host take a long time to process its next query, and that even when processing queries taking a long time, the number of in-flight requests doesn't grow a lot. This means that processing requests taking a long time doesn't block other small requests to be processed, which is good!

WDQS public cluster time-to-next query count for June 2020
WDQS public cluster in-flight sum of query-time-class (floored log10 value) per in-flight request-count for active backend hosts for June 2020

Same repeated query

The query SELECT ?simbad ?item { ?item wdt:P3083 ?simbad VALUES ?simbad {}} has been repeated 150794 times in June 2020, with an average query-time of 76ms. We use this query as a baseline and show that there is no noticeable correlation between queries concurrency and the repeated query processing-time.

WDQS public cluster repeated simbad query rounded query-time count for June 2020
WDQS public cluster repeated simbad query rounded query-time per in-flight requests-count for June 2020
WDQS public cluster repeated simbad query rounded query-time per in-flight sum of query-time for June 2020
WDQS public cluster repeated simbad query rounded query-time per in-flight max of query-time for June 2020