Channels ▼
RSS

Testing

Full-Text Search with InnoDB


Create an FTS Index on an InnoDB Table

For creating a full-text index, we recommend loading the table with data first, then creating the FULLTEXT index. This is much faster than creating the table with an empty FULLTEXT index, then inserting documents into it.

use test;

drop table if exists quotes;
-- The InnoDB full-text search feature in the 5.6 Labs release
-- lets us define a full-text index on an InnoDB table.

create table quotes
  (    id int unsigned auto_increment primary key
    , author varchar(64)    
    , quote varchar(400)
    , source varchar(64)
  ) engine=innodb;

-- Insert some words and phrases to search for into the table.
insert into quotes (author, quote, source) values
  ('Abraham Lincoln', 'Fourscore and seven years ago...',
  'Gettysburg Address')
, ('George Harrison', 'All those years ago...', 'Live In Japan')
, ('Arthur C. Clarke', 'Then 10 years ago the monolith was discovered.',
  '2010: The Year We Make Contact')
, ('Benjamin Franklin',
  'Early to bed and early to rise, makes a man healthy, wealthy, and wise.',
  'Poor Richard''s Almanack')
, ('James Thurber',
  'Early to rise and early to bed makes a male healthy and wealthy and dead.',
  'The New Yorker')
, ('K', '1500 hundred years ago, everybody knew that the Earth was the center of the universe.', 'Men in Black');

-- Create the fulltext index
create fulltext index idx on quotes(quote);

-- You can also create fulltext index at create table time
-- create table quotes
--  (    id int unsigned auto_increment primary key
--    , author varchar(64)    
--    , quote varchar(400)
--    , source varchar(64)
--    , fulltext (quote)
--  ) engine=innodb;

InnoDB can create the index in parallel through InnoDB's Fast Index Creation (FIC) feature, which supports parallel sorting. This technique actually allows you to tokenize, sort, and create an InnoDB FULLTEXT index in parallel.

To control the parallel sort degree, you can use a new configuration variable innodb_ft_sort_pll_degree (default 2, maximum 32). This variable specifies how many ways to parallelize the tokenization and sort operations. Experiments show, on a non-I/O bound system, the create index performance scales well with the number of CPUs and innodb_ft_sort_pll_degree.

The following table shows a quick run on 2.7 GB of Wikipedia data on an 8-core Linux x86 machine:

Server Time (min)
MyISAM 11 min 47.90
InnoDB (default) 7 min 25.21 sec
InnoDB pll_degree 5 min 34.98
InnoDB pll_degree = 8 4 min 9 sec
InnoDB pll_degree = 16 3 min 39.51 sec

Each InnoDB table with a FULLTEXT index includes a column FTS_DOC_ID. Specifying this column in the original table definition saves a rebuild of the entire table when an InnoDB FULLTEXT index is created. To do so, add the column FTS_DOC_ID (all uppercase) to the table with the FTS index. The column must be of BIGINT UNSIGNED NOT NULL datatype. It does not need to be an auto-increment column, but auto_increment could make the loading easier:

CREATE TABLE fts_test (
FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
title VARCHAR(200),
body TEXT) ENGINE=InnoDB;

-- The unique "FTS_DOC_ID_INDEX" index on FTS_DOC_ID is also optional,
-- without it, InnoDB will automatically create it.:
CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on fts_test(FTS_DOC_ID);

Then load the data and:

CREATE FULLTEXT INDEX idx on fts_test (title, body);

Please note, currently this special column must be named FTS_DOC_ID, and the name must be all uppercase. If you leave out this column, InnoDB adds FTS_DOC_ID as a hidden column automatically when the FULLTEXT index is created — and that is an expensive operation for an InnoDB table, so it is much more efficient to define the column yourself.

Query, Natural Language Search

Once the data is loaded and committed, you can run queries using the MATCH(columns) AGAINST (search expression)operator to conduct actual searches. You can combine this operator with the WHERE clause and similar clauses in the SELECT statement.

-- Search for a single word.
select author as "Monolith" from quotes
  where match(quote) against ('monolith' in natural language mode);

Monolith
Arthur C. Clarke

1 row in set (0.01 sec)

-- The default minimum length is 3 rather than 4, and the search
-- returns words that appear in a high proportion of the table rows.
select author as "Ago" from quotes
  where match(quote) against ('ago' in natural language mode);
Ago
Abraham Lincoln
George Harrison
Arthur C. Clarke
K

4 rows in set (0.00 sec)

Query, Boolean Search

For more complicated searches, you can have multiple words and phrases, and search for different combinations of optional and required terms, not necessarily in the same order. This technique typically involves several data values that you query from elsewhere, or splitting apart a user-entered string and applying your own rules to the words and phrases inside.

-- Search for a combination of words, not in the same order as the original.
select author as "Ago and Years" from quotes
  where match(quote) against ('+ago +years' in boolean mode);

Resulting in:

Ago and Years
Abraham Lincoln
George Harrison
Arthur C. Clarke
K

4 rows in set (0.00 sec)

Proximity Search

Proximity search is a new feature in InnoDB full-text search. It is a special case of Boolean search using the @ operator within the AGAINST() string. You supply two or more words, double-quoted, within the single-quoted AGAINST() string, followed by @distance to specify how far apart these words can be. The distance represents the maximum number of bytes (which might not be equal to the number of characters) between the starting points of all these words.

-- The starting points for these words are too far apart
-- (not within 20 bytes), so no results.
select quote as "Too Far Apart" from quotes
  where match(quote) against ('"early wise" @20' in boolean mode);
Empty set (0.00 sec)

-- But the starting points of all words are within 100 bytes,
-- so this query does give results.
select quote as "Early...Wise" from quotes
  where match(quote) against ('"early wise" @100' in boolean mode);

This results in:

Early...Wise
Early to bed and early to rise, makes a man healthy, wealthy, and wise.

1 row in set (0.00 sec)

Transactions

One of the key ideas behind bringing full-text search to InnoDB tables is to make this feature compatible with transactions. You can design tables with both full-text columns and other columns. Multiple sessions can update the full-text column data (and other columns in the table) simultaneously. The full-text data doesn't have to be treated as read-only or read-mostly.

One thing to note is that, just like the search feature in most transactional databases (such as Oracle Text for the Oracle Database), the tokenization for inserted strings is performed only at commit time, so a full-text search does not see the uncommitted data.

create table quotes_uncommitted
  (
      author varchar(64)
    , quote varchar(4000)
    , source varchar(64)
    , fulltext(quote)
    , primary key (author, quote(128))
  );

-- Insert but don't immediately commit.
insert into quotes_uncommitted select author, quote, source from quotes;

-- Within the same transaction, a full-text search does not see the uncommitted data.
select count(author), author as "Uncommitted Results" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);

Which gives us:

count(author) Uncommitted Results
0 NULL

1 row in set (0.00 sec)

-- OK, let's start with some committed data in the table, then empty the table,
-- then try some FTS queries, both before and after the commit.
insert into quotes_uncommitted select author, quote, source from quotes;
commit;
delete from quotes_uncommitted;
select count(author), author as "Deleted but still not committed" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);

Resulting in:

count(author) Deleted but still not committed
0 NULL

1 row in set (0.00 sec)

rollback;
select count(author), author as "Deleted and rolled back" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);

Which results in:

count(author) Deleted and rolled back
4 Abraham Lincoln

1 row in set (0.00 sec)

delete from quotes_uncommitted;
commit;
select count(author), author as "Deleted and committed" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
count(author) Deleted and committed
0 NULL

1 row in set (0.00 sec)

insert into quotes_uncommitted select author, quote, source from quotes;
commit;
truncate table quotes_uncommitted;
select count(author), author as "Truncated" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);

With the result of:

count(author) Truncated
0 NULL

1 row in set (0.00 sec)


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video