You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data/Cirrus: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Bearloga
(Added current schema for easier reference)
imported>Quiddity
 
(3 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Idealized schema ==
#REDIRECT [[Obsolete:Analytics/Data Lake/Traffic/Cirrus]]
 
This page documents an idealised schema for the Cirrus search requests table.
 
<pre>
dt                  string              Timestamp at cache in ISO 8601 - "2015-07-25 07:53:52" - first field in the existing tarballs.
hostname            string              Source node hostname, e.g. "mw1168" - second field in the existing tarballs.
source                  string                  The wiki it came from, e.g. "enwiki" - third field in the existing tarballs.
target_index            string                  The target index, e.g. "enwiki_content" - (in some rare cases multiple indexes can be requested can we have an array of string here?)
ip                  string              IP of packet at cache. This will need to be extracted and passed through.
x_forwarded_for        string                  The x_forwarded_for field. Will need to be extracted and passed through.
search_query          string              The actual search query.
user_agent              string                  The user agent of the request.
search_type            string                  The type of search request it was; "full text", "prefix" or NULL. We actually probably don't want the maintenance tasks in here, do we?
total_time              int                    Total time taken.
es_time                int                    ElasticSearch time taken.
total_results          int                    Total results found.
returned_results        int                    Number of results returned.
result_index            int                    Index of returned results
search_suggestion      string                  The search suggestion provided; NULL if none.
executor_id            int                    a temporary unique ID identifying the executor, allowing us to group chains of queries as a single success or failure.
is_api                  boolean                A flag identifying whether the request was from the API (true) or web (false).
year                int                Unpadded year of request
month              int                Unpadded month of request
day                int                Unpadded day of request
hour                int                Unpadded hour of request
 
# Partition Information
# col_name            data_type          comment           
is_api                  boolean                A flag identifying whether the request was from the API (true) or web (false).
year                int                Unpadded year of request
month              int                Unpadded month of request
day                int                Unpadded day of request
hour                int                Unpadded hour of request
</pre>
 
== Current schema ==
 
<pre>
ts                  int                Timestamp at cache in ISO 8601
wikiid              string              Source node hostname, e.g. "mw1168"
source              string              The wiki it came from, e.g. "enwiki"
identity            string              MD5(UA + XFF + Optional String)
ip                  string              IP of packet at cache
useragent          string              The user agent of the request.
backendusertests    array<string>      Lists A/B tests the user is enrolled in.
payload            map<string,string>
requests            array<
                      struct<
                        query:string,
                        querytype:string,
                        indices:array<string>,
                        tookms:int,
                        elastictookms:int,
                        limit:int,
                        hitstotal:int,
                        hitsreturned:int,
                        hitsoffset:int,
                        namespaces:array<int>,
                        suggestion:string,
                        suggestionrequested:boolean,
                        payload:map<string,string>
                      > # /struct
                    > # /array
year                string              Unpadded year of request
month              string              Unpadded month of request
day                string              Unpadded day of request
hour                string              Unpadded hour of request                 
 
# Partition Information
# col_name          data_type          comment           
year                string              Unpadded year of request
month              string              Unpadded month of request
day                string              Unpadded day of request
hour                string              Unpadded hour of request
</pre>

Latest revision as of 20:27, 25 October 2019