Channels ▼
RSS

Testing

Full-Text Search with InnoDB


Oracle recently provided access to many new MySQL 5.6 features through http://labs.mysql.com for the user community to test and comment on. One notable feature is the InnoDB Full-Text Search (FTS) engine. It lets users build FULLTEXT indexes on InnoDB tables to represent text-based content and speed up searches for words and phrases. Combining full-text search with InnoDB tables opens up text capability to transactional applications, where the textual data is frequently inserted, updated, and deleted. Given the importance of this feature to the MySQL audience, this article explains the design of InnoDB FTS and provides recipes for its use, as well as a short list of the limitations of this release.

The Design of Full Text Search

Like most database-powered search features, InnoDB Full-Text Search is designed as an "inverted index," in which the incoming text strings are tokenized into individual words, and these words are stored in one or more auxiliary tables. For each word, a list of Document IDs and word position pairs is stored. We call such a Document ID/Position pair list an "ilist." Thus, in the index table, two important columns are word and ilist. InnoDB FTS supports Proximity Search, which has been lacking in the MyISAM FTS feature, by storing the position (as a byte offset) for each word.

InnoDB FTS represents each search index as six tables. The words are divided among these tables based on their first characters. Currently, the partition criteria are hard coded, targeted to the Latin character set.

The partitioned design helps to parallelize the operation. In the current release, InnoDB only parallelizes the CREATE INDEX operation, but it can parallelize other operations such as query and full-text search later.

Let's move to an example to illustrate the auxiliary tables with an InnoDB FTS index.

Starting in MySQL 5.6, you can examine InnoDB metadata (the InnoDB system tables) through corresponding tables in the INFORMATION_SCHEMA database. Let's see how these auxiliary tables are created alongside the FTS index. These tables are specific to InnoDB, so you will not find them in MySQL's metadata, and you cannot query them through normal SQL interfaces.

Here is the definition of the quotes table used for our example:

  create table quotes
  (    id int unsigned auto_increment primary key
    , author varchar(64)    
    , quote varchar(400)
    , source varchar(64)
  ) engine=innodb;

-- Create the fulltext index
create fulltext index idx on quotes(quote);

-- Now check tables in the InnoDB SYS_TABLES:

mysql> SELECT table_id, name FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES;

This generates:

table_id name
... ...
548 test/FTS_0000000000000223_0000000000000257_INDEX_1
549 test/FTS_0000000000000223_0000000000000257_INDEX_2
550 test/FTS_0000000000000223_0000000000000257_INDEX_3
547 test/quotes
... ...

19 rows in set (0.01 sec)

This output shows that there are multiple auxiliary tables with names prefixed with FTS_ and postfixed with INDEX_*. They are all inverted index tables associated with table test/quotes through its table_id column. The table test/quotes has a table_id value of 547, whose hex representation is 0x223, and "223" is embedded in the names of all the auxiliary tables associated with the quotes table. Another hex value in the auxiliary table's name is the FULLTEXT index ID. For example, in the aforementioned example, for the table test/FTS_0000000000000223_0000000000000257_INDEX_3, the hex value 0x223 (547) means it is an auxiliary index table for table quotes, and its index ID is 0x257 (599):

The index information can be retrieved through INFORMATION_SCHEMA.INNODB_SYS_INDEXES, using SELECT index_id, table_id, name FROM INFORMATION_SCHEMA.INNODB_SYS_INDEXES WHERE table_id=547;

generating:

index_id table_id name
598 547 PRIMARY
599 547 idx
600 547 FTS_DOC_ID_INDEX

3 rows in set (0.01 sec)

Index idx is the FTS index we just created. There is also one index FTS_DOC_ID_INDEX on the table. This is the index on the FTS_DOC_ID column (which we will discuss later), and it is created along with the FULLTEXT index if it does not exist already.

Batching the Insert Value with the InnoDB FTS Index Cache

The InnoDB FULLTEXT index is represented on disk as a set of auxiliary tables, each consisting of columns word and ilist. InnoDB has an additional in-memory structure, the FTS Index Cache, which batches the inserted values before they are synchronized to the auxiliary index tables on disk. It acts like the InnoDB change buffer to save I/O during intensive DML operations.

The FTS Index Cache represents the same word values as the FULLTEXT index, using a red-black tree data structure. By caching the words from the inserted documents, we can avoid reading from disk while doing the initial filtering phase of a search, then merge (via union) the in-memory word list with the document IDs and positional data from the FTS index tables when producing the result set.

The buffering by the FTS Index Cache avoids frequent updates to the underlying FTS index tables during busy inserts and updates; the FULLTEXT index only needs to be synchronized when the index cache is full. The batching technique also minimizes the number of entries for each word in the FULLTEXT index. Instead of flushing each word with a single ilist entry (with a single Document ID), it combines results from many inserted documents, creating an ilist with multiple Document ID/Position pairs, and storing this information to disk as a single entry. This technique reduces redundant data, making the FULLTEXT index smaller.

The index cache size is configurable by the option innodb_ft_cache_size, with a default value of 32 MB. This option will be discussed in greater detail with the specifics of DML operations later.

InnoDB FTS Document ID

Another important concept is Document ID management. As with most FTS engines, the mapping from word to record goes through a unique ID. In our case, it is represented by the FTS_DOC_ID (uppercase required) column. If this column is not defined, InnoDB automatically adds it to the user table when creating the FULLTEXT index. The column itself must be of BIGINT UNSIGNED NOT NULL type, with a unique index named FTS_DOC_ID_INDEX. When you define this column during table creation, it saves considerable time in creating the FULLTEXT index after loading data, since InnoDB does not need to reorganize the table (as happens when a new column is added). In such a case, you must manage the Document ID column to avoid empty or duplicate values.

Using the InnoDB FTS

New features often involve new syntax or changes to the existing syntax. However, this InnoDB feature is compatible with the existing MySQL syntax for full-text search using MyISAM tables. If you are familiar with the existing MyISAM FTS feature, you can quickly start using the equivalent InnoDB full-text search. However, there are some recipes and tips. (After these, we discuss a few limitations of this release.)


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video