You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Schema change 2004-08 searchindex

From Wikitech-static
Jump to navigation Jump to search

This is the description of an improved table structure and query for full text search.

Full text search

/* Jamesday added si_namespace and si_is_redirect and key si_page_is_redirect_namespace */

CREATE TABLE jamesday_searchindex (
  si_page int(8) unsigned NOT NULL default '0',
  si_namespace tinyint(2) unsigned NOT NULL default '0',
  si_is_redirect tinyint(1) unsigned NOT NULL default '0',
  si_title varchar(255) NOT NULL default '',
  si_text mediumtext NOT NULL,
  UNIQUE KEY si_page (si_page),
/* not required, used in tests  KEY si_is_redirect_namespace (si_is_redirect, si_namespace), */
  UNIQUE KEY si_page_is_redirect_namespace (si_page, si_is_redirect, si_namespace),
  FULLTEXT KEY si_title (si_title), /* not required if different title-only table is used */
  FULLTEXT KEY si_text (si_text)
) TYPE=MyISAM PACK_KEYS=1;

For tests the data was loaded with:

insert into jamesday_searchindex (si_page, si_namespace, si_is_redirect, si_title, si_text)
select si_page, cur_namespace, cur_is_redirect, si_title, si_text
from searchindex, cur
where si_page=cur_id limit 200;

This query was used:

/* using modified searchindex with self join to make an index provide the where values */
SELECT si2.si_page,si2.si_namespace
FROM jamesday_searchindex as si2 use index (si_page_is_redirect_namespace),
jamesday_searchindex as si
WHERE si2.si_page=si.si_page
AND  (MATCH(si.si_text) AGAINST('+who +is +the +republican +canidate +for +president +of +the +united +states' IN BOOLEAN MODE)
AND si2.si_is_redirect=0)
AND si2.si_namespace IN (0,9,11)
LIMIT 0, 20;

It should be followed with a search to find the cur_text values. Adding cur in the query added about 15 seconds to a 40 second time, using this query to test it:

/* add cur to produce the final result */
SELECT si2.si_page,si2.si_namespace, cur_title
FROM jamesday_searchindex as si2 use index (si_page_is_redirect_namespace),
jamesday_searchindex as si,
cur
WHERE si2.si_page=si.si_page
AND  (MATCH(si.si_text) AGAINST('+who +is +the +republican +canidate +for +president +of +the +united +states' IN BOOLEAN MODE)
AND si2.si_is_redirect=0)
AND si2.si_namespace IN (0,9,11) 
AND cur_id=si2.si_page
LIMIT 0, 20;

The time penalty seems to make this version less desirable than making two queries, the first to get the IDs, the second the text.

It's probably helpful to get 40 results and only load the textof the first 20 initially. Then have the next page button on the search form send back the IDs of the next 20 records and retrive the body text for them. The initial search is costly but it doesn't take much longer to get a few more record IDs at the same time. May be worth getting more than one page more - no idea how often more than one page of results is needed.

These versions avoid loading large parts of cur or searchindex into cache to test the where terms, by using indexes which contain them all, using the self join to get the ability to use two different indexes on searchindex. Using searchindex like this was faster than adding the indexes directly to cur. Because a search server needs to have lots of cache for MyISAM, this should work better still for a search server, which won't need to allocate much InnoDB cache.

Title only search

If a title only search is intended to be a low load alternative to full text search, it's better to use a different table which doesn't have the si_text field. That table will use much less cache memory because it won't need to store the full text of the article. That should scale very well.

General

Making sure that all parts of a where can be found from indexes is why these are so much faster than the current version, which has to read large parts of cur to carry out the where tests. MySQL can only use one index per query, so this self join trick is required to use more. Doing a self join just to get the ability to use a second index may be worthwhile in other situations as well.