Create an FTS Index on an InnoDB Table
For creating a full-text index, we recommend loading the table with data first, then creating the FULLTEXT
index. This is much faster than creating the table with an empty FULLTEXT
index, then inserting documents into it.
use test; drop table if exists quotes; -- The InnoDB full-text search feature in the 5.6 Labs release -- lets us define a full-text index on an InnoDB table. create table quotes ( id int unsigned auto_increment primary key , author varchar(64) , quote varchar(400) , source varchar(64) ) engine=innodb; -- Insert some words and phrases to search for into the table. insert into quotes (author, quote, source) values ('Abraham Lincoln', 'Fourscore and seven years ago...', 'Gettysburg Address') , ('George Harrison', 'All those years ago...', 'Live In Japan') , ('Arthur C. Clarke', 'Then 10 years ago the monolith was discovered.', '2010: The Year We Make Contact') , ('Benjamin Franklin', 'Early to bed and early to rise, makes a man healthy, wealthy, and wise.', 'Poor Richard''s Almanack') , ('James Thurber', 'Early to rise and early to bed makes a male healthy and wealthy and dead.', 'The New Yorker') , ('K', '1500 hundred years ago, everybody knew that the Earth was the center of the universe.', 'Men in Black'); -- Create the fulltext index create fulltext index idx on quotes(quote); -- You can also create fulltext index at create table time -- create table quotes -- ( id int unsigned auto_increment primary key -- , author varchar(64) -- , quote varchar(400) -- , source varchar(64) -- , fulltext (quote) -- ) engine=innodb;
InnoDB can create the index in parallel through InnoDB's Fast Index Creation (FIC) feature, which supports parallel sorting. This technique actually allows you to tokenize, sort, and create an InnoDB FULLTEXT
index in parallel.
To control the parallel sort degree, you can use a new configuration variable innodb_ft_sort_pll_degree
(default 2, maximum 32). This variable specifies how many ways to parallelize the tokenization and sort operations. Experiments show, on a non-I/O bound system, the create index performance scales well with the number of CPUs and innodb_ft_sort_pll_degree
.
The following table shows a quick run on 2.7 GB of Wikipedia data on an 8-core Linux x86 machine:
Server | Time (min) |
MyISAM | 11 min 47.90 |
InnoDB (default) | 7 min 25.21 sec |
InnoDB pll_degree | 5 min 34.98 |
InnoDB pll_degree = 8 | 4 min 9 sec |
InnoDB pll_degree = 16 | 3 min 39.51 sec |
Each InnoDB table with a FULLTEXT
index includes a column FTS_DOC_ID
. Specifying this column in the original table definition saves a rebuild of the entire table when an InnoDB FULLTEXT
index is created. To do so, add the column FTS_DOC_ID
(all uppercase) to the table with the FTS index. The column must be of BIGINT UNSIGNED NOT NULL
datatype. It does not need to be an auto-increment column, but auto_increment
could make the loading easier:
CREATE TABLE fts_test ( FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL, title VARCHAR(200), body TEXT) ENGINE=InnoDB; -- The unique "FTS_DOC_ID_INDEX" index on FTS_DOC_ID is also optional, -- without it, InnoDB will automatically create it.: CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on fts_test(FTS_DOC_ID);
Then load the data and:
CREATE FULLTEXT INDEX idx on fts_test (title, body);
Please note, currently this special column must be named FTS_DOC_ID
, and the name must be all uppercase. If you leave out this column, InnoDB adds FTS_DOC_ID
as a hidden column automatically when the FULLTEXT
index is created and that is an expensive operation for an InnoDB table, so it is much more efficient to define the column yourself.
Query, Natural Language Search
Once the data is loaded and committed, you can run queries using the MATCH
(columns) AGAINST
(search expression)operator to conduct actual searches. You can combine this operator with the WHERE
clause and similar clauses in the SELECT
statement.
-- Search for a single word. select author as "Monolith" from quotes where match(quote) against ('monolith' in natural language mode);
Monolith |
Arthur C. Clarke |
1 row in set (0.01 sec)
-- The default minimum length is 3 rather than 4, and the search -- returns words that appear in a high proportion of the table rows. select author as "Ago" from quotes where match(quote) against ('ago' in natural language mode);
Ago |
Abraham Lincoln George Harrison Arthur C. Clarke K |
4 rows in set (0.00 sec)
Query, Boolean Search
For more complicated searches, you can have multiple words and phrases, and search for different combinations of optional and required terms, not necessarily in the same order. This technique typically involves several data values that you query from elsewhere, or splitting apart a user-entered string and applying your own rules to the words and phrases inside.
-- Search for a combination of words, not in the same order as the original. select author as "Ago and Years" from quotes where match(quote) against ('+ago +years' in boolean mode);
Resulting in:
Ago and Years |
Abraham Lincoln George Harrison Arthur C. Clarke K |
4 rows in set (0.00 sec)
Proximity Search
Proximity search is a new feature in InnoDB full-text search. It is a special case of Boolean search using the @
operator within the AGAINST()
string. You supply two or more words, double-quoted, within the single-quoted AGAINST()
string, followed by @distance
to specify how far apart these words can be. The distance represents the maximum number of bytes (which might not be equal to the number of characters) between the starting points of all these words.
-- The starting points for these words are too far apart -- (not within 20 bytes), so no results. select quote as "Too Far Apart" from quotes where match(quote) against ('"early wise" @20' in boolean mode); Empty set (0.00 sec) -- But the starting points of all words are within 100 bytes, -- so this query does give results. select quote as "Early...Wise" from quotes where match(quote) against ('"early wise" @100' in boolean mode);
This results in:
Early...Wise |
Early to bed and early to rise, makes a man healthy, wealthy, and wise. |
1 row in set (0.00 sec)
Transactions
One of the key ideas behind bringing full-text search to InnoDB tables is to make this feature compatible with transactions. You can design tables with both full-text columns and other columns. Multiple sessions can update the full-text column data (and other columns in the table) simultaneously. The full-text data doesn't have to be treated as read-only or read-mostly.
One thing to note is that, just like the search feature in most transactional databases (such as Oracle Text for the Oracle Database), the tokenization for inserted strings is performed only at commit time, so a full-text search does not see the uncommitted data.
create table quotes_uncommitted ( author varchar(64) , quote varchar(4000) , source varchar(64) , fulltext(quote) , primary key (author, quote(128)) ); -- Insert but don't immediately commit. insert into quotes_uncommitted select author, quote, source from quotes; -- Within the same transaction, a full-text search does not see the uncommitted data. select count(author), author as "Uncommitted Results" from quotes_uncommitted where match(quote) against ('ago' in natural language mode);
Which gives us:
count(author) | Uncommitted Results |
0 | NULL |
1 row in set (0.00 sec)
-- OK, let's start with some committed data in the table, then empty the table, -- then try some FTS queries, both before and after the commit. insert into quotes_uncommitted select author, quote, source from quotes; commit; delete from quotes_uncommitted; select count(author), author as "Deleted but still not committed" from quotes_uncommitted where match(quote) against ('ago' in natural language mode);
Resulting in:
count(author) | Deleted but still not committed |
0 | NULL |
1 row in set (0.00 sec)
rollback; select count(author), author as "Deleted and rolled back" from quotes_uncommitted where match(quote) against ('ago' in natural language mode);
Which results in:
count(author) | Deleted and rolled back |
4 | Abraham Lincoln |
1 row in set (0.00 sec)
delete from quotes_uncommitted; commit; select count(author), author as "Deleted and committed" from quotes_uncommitted where match(quote) against ('ago' in natural language mode);
count(author) | Deleted and committed |
0 | NULL |
1 row in set (0.00 sec)
insert into quotes_uncommitted select author, quote, source from quotes; commit; truncate table quotes_uncommitted; select count(author), author as "Truncated" from quotes_uncommitted where match(quote) against ('ago' in natural language mode);
With the result of:
count(author) | Truncated |
0 | NULL |
1 row in set (0.00 sec)