DML (Insert, Update, and Delete)
As mentioned in the last section, the inserting string processing (tokenization) is performed only at commit time. This behavior is consistent with FTS in most transactional database systems, since the DMLs on tables with full-text search indexes are costly operations. Typically, you synchronize the results of DML operations to the FTS index periodically rather than in real time.
A typical example is the Oracle Text component of Oracle Database, which synchronizes table data with the search index periodically or manually. Starting in Oracle Database 11g, it now allows users to specify when data updates are reflected in the full-text index manually, on commit, or at a regular interval.
For InnoDB Full-Text Search, the FTS index is updated at transaction commit time, when all the document tokenization and inserting into the FTS index (or associated cache) happens. All other DML operations still follow their usual rules about commit/rollback and isolation levels.
- Insert. The inserted document is tokenized at commit time and inserted into the FTS Index Cache. This cache has a configurable size (
innodb_ft_cache_size) with a default of 32 MB. Once this cache is full, it is synchronized to the on-disk tables that represent the index. During normal server shutdown, any content in the FTS Index Cache is synchronized to the
FULLTEXTindex on disk. If there is a server crash and the content of FTS Index Cache was not saved to disk, after the server reboot, the data is made consistent when you first query or insert into the
FULLTEXTindex. Any documents missing from the
FULLTEXTindex are read from the original table, retokenized, and added to the FTS Index Cache.
- Delete. When you delete rows from a table containing an InnoDB
FULLTEXTindex, InnoDB does not delete the corresponding word entries in the FTS auxiliary tables. At commit time, InnoDB records the deleted Document IDs in another auxiliary table named
DELETED. For each query, InnoDB consults this
DELETEDtable to filter out any deleted documents. This design makes the delete operation simple and fast, without the need to update thousands of word entries for each deleted document. Since the word entries are not removed from the
FULLTEXTindex automatically, you need to perform index optimization periodically (as described in the next section) to keep the index size reasonable.
- Update. For updates affecting any FTS-indexed columns, the updates are performed as a combination of Delete + Insert operations. In-place update happens only if the update affects none of the columns referenced by the
As just discussed, if there is substantial update and delete activity on an InnoDB
FULLTEXT index, the index could become bloated, as InnoDB logs the deleted Document IDs in the
DELETED auxiliary table. Over time, the
FULLTEXT index could grow bigger despite rows being removed from the original table. To resolve this, you can optimize the index. This operation does two things: it removes the deleted Document ID from the word's Document ID/Position pair list (
ilist), and it consolidates multiple entries for the same word to one entry if possible by consolidating their Document ID/Position pair list (
Currently, MySQL runs the optimization operation only as part of the
OPTIMIZE TABLE command, when the
innodb_optimize_fulltext_only configuration variable is enabled:
mysql> set global innodb_optimize_fulltext_only=1; mysql> optimize table articles;
As the table could be large, and optimization could take a significant amount of time, you typically should do the optimization in stages. The configuration variable
innodb_ft_num_word_optimize specifies how many words to optimize for each
OPTIMIZE TABLE command. It defaults to 2000 words. When the next
OPTIMIZE TABLE command is issued, the server continues the process from where it left off.
Stopwords are common or trivial words, such as "the" and "an," which are omitted from the
FULLTEXT index. They are potentially different for each language or application. InnoDB FTS provides two sources of stopwords:
- MySQL server predefined stopwords. If no user stopword list is defined, this default list is used. You can view this list by querying the table
mysql> select * from INNODB_FT_DEFAULT_STOPWORD;
- User-defined stopwords.You can define your own stopwords by creating a table with a single column named
value, with datatype
varchar, and pointing the global variable
innodb_ft_server_stopword_tableto this table. MySQL loads stopwords from this user table, rather than the default stopword list, when creating the FTS index. For example:
# Define a correctly formatted user stopword table create table user_stopword(value varchar(30)) engine = innodb; # Point to this stopword table with "db name/table name" set global innodb_ft_server_stopword_table = "test/user_stopword";
Earlier in this article, we demonstrated ways to make creating an InnoDB
FULLTEXT index faster. Preliminary results on a 2.7 GB Wikipedia data set are shown in the following table. The queries used include various combinations of include, exclude, wildcard, and proximity operators, even an intentional misspelling ("preassumtions"). They are identified by the number in the leftmost column:
1) select count(*) from wp where match(title,text) against ('Creationist +Abrahamic -creationism' in boolean mode); 2) select count(*) from wp where match(title,text) against ('Abrahamic'); 3) select count(*) from wp where match(title,text) against ('preassumtions +orangutan' in boolean mode); 4) select count(*) from wp where match(title,text) against ('orangutan +falsified ~naturalistically' in boolean mode); 5) select count(*) from wp where match(title,text) against ('+india* -leader +gandh*' in boolean mode); 6) select count(*) from wp where match(title,text) against ('"american culture"@09′ in boolean mode);
|Query||InnoDB (ms)||* InnoDB FTS Search Processing Time||MyISAM (ms)|
|6 (New InnoDB Feature)||5250||N/A|
* FTS Search processing time is the actual time spent by InnoDB full-text index scan.
As this version of FTS is fairly new, there are some aspects that have limitations that Oracle will remove in succedding versions. These include:
- Ranking: Currently, the FTS uses a very simple ranking mechanism (term frequency, inverse document frequency) for document result ranking. The more the word appears in a document, and the less frequent the word appears in overall documents, the higher the selected document is ranked.
- Stemming, which matches alternative forms such as singular, plural, past tense, and other forms derived from the same root word is not supported.
- Ideographic languages, Chinese, Japanese and Korean (referred to as CJK), which do not have word delimiters, do not yet have support for N-GRAM parsing. (MyISAM FTS has a similar limitation.)
- Single character set: Although the use of multiple character sets within a single table is supported, all columns in a
FULLTEXTindex must use the same character set and collation. (MyISAM FTS has a similar limitation.)
However, the InnoDB full-text search gives InnoDB an important capability in using MySQL to handle text documents in transaction-intensive database applications. Subsequent releases will continue to increase its performance, usability, and feature set.
Jimmy Yang and John Russell work at Oracle on the InnoDB product.