Oracle recently provided access to many new MySQL 5.6 features through http://labs.mysql.com for the user community to test and comment on. One notable feature is the InnoDB Full-Text Search (FTS) engine. It lets users build
FULLTEXT indexes on InnoDB tables to represent text-based content and speed up searches for words and phrases. Combining full-text search with InnoDB tables opens up text capability to transactional applications, where the textual data is frequently inserted, updated, and deleted. Given the importance of this feature to the MySQL audience, this article explains the design of InnoDB FTS and provides recipes for its use, as well as a short list of the limitations of this release.
The Design of Full Text Search
Like most database-powered search features, InnoDB Full-Text Search is designed as an "inverted index," in which the incoming text strings are tokenized into individual words, and these words are stored in one or more auxiliary tables. For each word, a list of Document IDs and word position pairs is stored. We call such a Document ID/Position pair list an "ilist." Thus, in the index table, two important columns are
ilist. InnoDB FTS supports Proximity Search, which has been lacking in the MyISAM FTS feature, by storing the position (as a byte offset) for each word.
InnoDB FTS represents each search index as six tables. The words are divided among these tables based on their first characters. Currently, the partition criteria are hard coded, targeted to the Latin character set.
The partitioned design helps to parallelize the operation. In the current release, InnoDB only parallelizes the
CREATE INDEX operation, but it can parallelize other operations such as query and full-text search later.
Let's move to an example to illustrate the auxiliary tables with an InnoDB FTS index.
Starting in MySQL 5.6, you can examine InnoDB metadata (the InnoDB system tables) through corresponding tables in the
INFORMATION_SCHEMA database. Let's see how these auxiliary tables are created alongside the FTS index. These tables are specific to InnoDB, so you will not find them in MySQL's metadata, and you cannot query them through normal SQL interfaces.
Here is the definition of the
quotes table used for our example:
create table quotes ( id int unsigned auto_increment primary key , author varchar(64) , quote varchar(400) , source varchar(64) ) engine=innodb; -- Create the fulltext index create fulltext index idx on quotes(quote); -- Now check tables in the InnoDB SYS_TABLES: mysql> SELECT table_id, name FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES;
19 rows in set (0.01 sec)
This output shows that there are multiple auxiliary tables with names prefixed with
FTS_ and postfixed with
INDEX_*. They are all inverted index tables associated with table
test/quotes through its
table_id column. The table
test/quotes has a
table_id value of 547, whose hex representation is 0x223, and "223" is embedded in the names of all the auxiliary tables associated with the
quotes table. Another hex value in the auxiliary table's name is the
FULLTEXT index ID. For example, in the aforementioned example, for the table
test/FTS_0000000000000223_0000000000000257_INDEX_3, the hex value 0x223 (547) means it is an auxiliary index table for table
quotes, and its index ID is 0x257 (599):
The index information can be retrieved through
SELECT index_id, table_id, name FROM INFORMATION_SCHEMA.INNODB_SYS_INDEXES WHERE table_id=547;
3 rows in set (0.01 sec)
idx is the FTS index we just created. There is also one index
FTS_DOC_ID_INDEX on the table. This is the index on the
FTS_DOC_ID column (which we will discuss later), and it is created along with the
FULLTEXT index if it does not exist already.
Batching the Insert Value with the InnoDB FTS Index Cache
FULLTEXT index is represented on disk as a set of auxiliary tables, each consisting of columns
ilist. InnoDB has an additional in-memory structure, the FTS Index Cache, which batches the inserted values before they are synchronized to the auxiliary index tables on disk. It acts like the InnoDB change buffer to save I/O during intensive DML operations.
The FTS Index Cache represents the same
word values as the
FULLTEXT index, using a red-black tree data structure. By caching the words from the inserted documents, we can avoid reading from disk while doing the initial filtering phase of a search, then merge (via union) the in-memory word list with the document IDs and positional data from the FTS index tables when producing the result set.
The buffering by the FTS Index Cache avoids frequent updates to the underlying FTS index tables during busy inserts and updates; the
FULLTEXT index only needs to be synchronized when the index cache is full. The batching technique also minimizes the number of entries for each word in the
FULLTEXT index. Instead of flushing each word with a single
ilist entry (with a single Document ID), it combines results from many inserted documents, creating an
ilist with multiple Document ID/Position pairs, and storing this information to disk as a single entry. This technique reduces redundant data, making the
FULLTEXT index smaller.
The index cache size is configurable by the option
innodb_ft_cache_size, with a default value of 32 MB. This option will be discussed in greater detail with the specifics of DML operations later.
InnoDB FTS Document ID
Another important concept is Document ID management. As with most FTS engines, the mapping from word to record goes through a unique ID. In our case, it is represented by the
FTS_DOC_ID (uppercase required) column. If this column is not defined, InnoDB automatically adds it to the user table when creating the
FULLTEXT index. The column itself must be of
BIGINT UNSIGNED NOT NULL type, with a unique index named
FTS_DOC_ID_INDEX. When you define this column during table creation, it saves considerable time in creating the
FULLTEXT index after loading data, since InnoDB does not need to reorganize the table (as happens when a new column is added). In such a case, you must manage the Document ID column to avoid empty or duplicate values.
Using the InnoDB FTS
New features often involve new syntax or changes to the existing syntax. However, this InnoDB feature is compatible with the existing MySQL syntax for full-text search using MyISAM tables. If you are familiar with the existing MyISAM FTS feature, you can quickly start using the equivalent InnoDB full-text search. However, there are some recipes and tips. (After these, we discuss a few limitations of this release.)