You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:AKhatun/Intro to WMF Search Data: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AKhatun
(Add brief description of data sources)
imported>AKhatun
(query_clicks_daily brief desc)
Line 61: Line 61:
== event.searchsatisfaction ==
== event.searchsatisfaction ==
== discovery.query_clicks_hourly ==
== discovery.query_clicks_hourly ==
The fields of this table are fairly clear from its schema definition. It contains the list of all the search results shown to the users with each full-text search and the list of all the pages the user clicked in each search. See [https://wikitech.wikimedia.org/wiki/Obsolete:Analytics/Data_Lake/Traffic/CirrusQueryClicks#discovery.query_clicks_hourly Schema] and [https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/airflow/dags/query_clicks.py#L87 Source Code] for more details.
== discovery.query_clicks_daily ==
== discovery.query_clicks_daily ==
Sessionized version of the hourly table. This table contains full-text search sessions with click thoughs. If you want all search sessions with ot without click throughs, you will have to check out the hourly table. Simply gives session_id to the queries.
== discovery.search_satisfaction_daily ==
== discovery.search_satisfaction_daily ==
'''What is a search session?''' <br />
"A search session identifies a single user performing searches within a limited timespan. If no search is performed within ten minutes of a previous search a new session id is generated." <ref>https://meta.wikimedia.org/wiki/Schema:SearchSatisfaction</ref> So, whatever a user does after searching, like clicking around, viewing pages, viewing next set of results are all given the same <code>sessionID</code>. A new session starts when this session is idle for 10 minutes.
<code>discovery.search_satisfaction_daily</code> is a sessionized version daily of <code>event.searchsatisfaction</code>.  The <code>event</code> table records each event separately whereas the <code>daily</code> table records searches session-wise with seperate rows for each full-text search (not autocomplete searches, only the searches done by users by pressing enter or the magnifying glass icon).
'''Additional explanation of some fields''': Make sure to do <code>describe table_name;</code> in hive or spark sql or whatever method you are accessing it through to see field comments.
* <code>dym_shown</code>: Whether the search engine result page (SERP) showed a Did You Mean (dym) suggestion. If the number of results is too less, the search engine will try to identify nearby words or phrases to search with and show that query as a suggestion to the user. If the number of results is 0, the engine will perform search with the suggested query and show those results instead. When these situations occur <code>dym_shown</code> is True.
* <code>is_autorewrite_dym</code>: The phenomenon of getting 0 results and so showing results of the suggested query is called <code>autorewrite</code>.
* <code>is_dym</code>: When the user cicks the dym suggestion, a search is performed with the suggested query. The new result page has <code>is_dym</code> set to True, because this is the dym suggested query search. It is also True for autorewrite queries since the the page is showing results for the suggested queries.
* <code>dym_clicked</code>: True when a user clicks the suggested query shown at the top of the page, i.e, the Did You Mean query.
N.B.: In case of a autorewrite <code>dym_shown</code>, <code>is_autorewrite_dym</code>, and <code>is_dym</code> are all True. For more info about the logic, see the source code: [https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/spark/generate_daily_search_satisfaction.py#L128-L148 Source Code#L128-L148]
== discovery.fulltext_head_queries ==
== discovery.fulltext_head_queries ==
This table is not much used at present. It contains:
* <code>norm_query</code> : The normalized query. Only few very basic normalizations were performed. See [https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/spark/fulltext_head_queries.py Source Code] docs for more info on what normalizations were done. Queries are normalized and then grouped together based on the normalized version.
* <code>num_sessions</code> : The number of sessions across which these queries had spanned (and are now grouped together).
* <code>queries</code> : The original queries that were normalized to <code>norm_query</code> along with the number of sessions each query was part of.
= References =
<references />
= Abbreviations =
* SERP: Search Engine Result Page
* dym: Did You Mean (the alternate query suggestion that comes after a search that does not have enough results)

Revision as of 21:10, 4 August 2022

Search Data

The search platform team at the foundation saves some temporary data from searches done in various wikimedia projects, analyzing which can help us understand what improvements can benefit users and what we can do to create better search experience for them. To do this, we need to first understand how search works and what are the various data stored. This page is intended to help you get started with search and search data: with resources, links, and brief explanations. This is not an exhaustive list or a complete explanation of all things related to search.

Where can you search from?

How search works

As you start typing on any of the search boxes mentioned above, the search process has already started. Every letter/group of letter typed fires a search event; once you press enter/click the magnifying glass icon, an event is fired; once you click a search result from the search result page, another event is fired. More about events later.

Here are some of the possibilities with searching:

  1. You start typing in the GO box or any other mediawiki search bar. After each letter you type, you get a drop down of tittle suggestions. These are called autocomplete searches. Sometimes if you type mutiple letters with quick succession, you will get these suggestions when you pause.
    1. You can click one of the tittle suggestions and go to that page directly
    2. Or, you can press enter or select search for pages containing <your text>. This takes you to the search results page.
  2. In the search results page, you will see your search results, results from other langauge wikis (if applicable), results from sister projects, and advanced search options. This is also the search special page. You can continue to perform other searches from here or read your results.
    1. Sometimes, the word or phrase you searched for may have no results. If the system thinks you meant something else, it will search for that and show those results instead. Search for azpw, the results will be populated for the word aziz and says Showing results for aziz. No results found for azpw.
    2. Sometimes the word or phrase you searched for has very few results. If the system thinks you meant something else, it will recommend a different (possibly correct) search. Search for alsha, it will say Did you mean: alpha. It still shows the little results it found for alsha, but you can click on alpha and view those results instead.
  3. On the side are results from sister projects
  4. At the bottom of the results are results from other language wikis if applicable. Search for বন্য প্রানি (a not English query) in the English wikipedia, for example.
  5. Some wikis have results from Wikidata at the bottom as well.

Useful resources

Few blog posts. Find more in diff.wikimedia.org.

Data Sources

Sources of data related to Search
Table name Database Description Docs Code
mediawiki_cirrussearch_request event Also known as query logs. Contains all search events including the query, the various hits returned from one or more wiki projects, time taken, and other backend information Schema -
searchsatisfaction event Table of various search events such as searchResultPage, click, checkin etc along with the query, number of hits returned and other search specific details. Schema Source Code
query_clicks_hourly discovery A cross of mediawiki_cirrussearch_request and searchsatisfaction to list each search query with its list of hits returned and clicks by the user Schema Source Code
query_clicks_daily discovery Sessionized version of the discovery.query_clicks_hourly table. Only contains queries with click throughs Schema Source Code
search_satisfaction_daily discovery A sessionized daily version of the event.searchsatisfaction table. Each search session and most of its related information are aggregated in individual rows - Source Code
fulltext_head_queries discovery Aggregate of queries and their results after making some minor alterations to the query string (e.g please and "please" --> please) - Source Code

Table details

event.mediawiki_cirrussearch_request

event.searchsatisfaction

discovery.query_clicks_hourly

The fields of this table are fairly clear from its schema definition. It contains the list of all the search results shown to the users with each full-text search and the list of all the pages the user clicked in each search. See Schema and Source Code for more details.

discovery.query_clicks_daily

Sessionized version of the hourly table. This table contains full-text search sessions with click thoughs. If you want all search sessions with ot without click throughs, you will have to check out the hourly table. Simply gives session_id to the queries.

discovery.search_satisfaction_daily

What is a search session?
"A search session identifies a single user performing searches within a limited timespan. If no search is performed within ten minutes of a previous search a new session id is generated." [1] So, whatever a user does after searching, like clicking around, viewing pages, viewing next set of results are all given the same sessionID. A new session starts when this session is idle for 10 minutes.

discovery.search_satisfaction_daily is a sessionized version daily of event.searchsatisfaction. The event table records each event separately whereas the daily table records searches session-wise with seperate rows for each full-text search (not autocomplete searches, only the searches done by users by pressing enter or the magnifying glass icon).

Additional explanation of some fields: Make sure to do describe table_name; in hive or spark sql or whatever method you are accessing it through to see field comments.

  • dym_shown: Whether the search engine result page (SERP) showed a Did You Mean (dym) suggestion. If the number of results is too less, the search engine will try to identify nearby words or phrases to search with and show that query as a suggestion to the user. If the number of results is 0, the engine will perform search with the suggested query and show those results instead. When these situations occur dym_shown is True.
  • is_autorewrite_dym: The phenomenon of getting 0 results and so showing results of the suggested query is called autorewrite.
  • is_dym: When the user cicks the dym suggestion, a search is performed with the suggested query. The new result page has is_dym set to True, because this is the dym suggested query search. It is also True for autorewrite queries since the the page is showing results for the suggested queries.
  • dym_clicked: True when a user clicks the suggested query shown at the top of the page, i.e, the Did You Mean query.

N.B.: In case of a autorewrite dym_shown, is_autorewrite_dym, and is_dym are all True. For more info about the logic, see the source code: Source Code#L128-L148

discovery.fulltext_head_queries

This table is not much used at present. It contains:

  • norm_query : The normalized query. Only few very basic normalizations were performed. See Source Code docs for more info on what normalizations were done. Queries are normalized and then grouped together based on the normalized version.
  • num_sessions : The number of sessions across which these queries had spanned (and are now grouped together).
  • queries : The original queries that were normalized to norm_query along with the number of sessions each query was part of.

References

Abbreviations

  • SERP: Search Engine Result Page
  • dym: Did You Mean (the alternate query suggestion that comes after a search that does not have enough results)