[[icu-tokenizer]] === icu_tokenizer
icu_tokenizer uses the same Unicode Text Segmentation algorithm as the
standard tokenizer,((("words", "identifying", "using icu_tokenizer")))((("Unicode Text Segmentation algorithm")))((("icu_tokenizer"))) but adds better support for some Asian languages by
using a dictionary-based approach to identify words in Thai, Lao, Chinese,
Japanese, and Korean, and using custom rules to break Myanmar and Khmer text
For instance, compare the tokens ((("standard tokenizer", "icu_tokenizer versus")))produced by the
icu_tokenizers, respectively, when tokenizing ``Hello. I am from Bangkok.'' in
standard tokenizer produces two tokens, one for each sentence:
ผมมาจากกรุงเทพฯ. That is useful only if you want to search for the whole
I am from Bangkok.'', but not if you want to search for justBangkok.''
icu_tokenizer, on the other hand, is able to break up the text into the
individual words (
กรุงเทพฯ), making them
easier to search.
In contrast, the
standard tokenizer ``over-tokenizes'' Chinese and Japanese
text, often breaking up whole words into single characters. Because there
are no spaces between words, it can be difficult to tell whether consecutive
characters are separate words or form a single word. For instance:
向 means facing, 日 means sun, and 葵 means hollyhock. When written together, 向日葵 means sunflower.
五 means five or fifth, 月 means month, and 雨 means rain. The first two characters written together as 五月 mean the month of May, and adding the third character, 五月雨 means continuous rain. When combined with a fourth character, 式, meaning style, the word 五月雨式 becomes an adjective for anything consecutive or unrelenting.
Although each character may be a word in its own right, tokens are more meaningful when they retain the bigger original concept instead of just the component parts:
GET /_analyze?tokenizer=standard 向日葵
standard tokenizer in the preceding example would emit each character
as a separate token:
emit the single token
Another difference between the
standard tokenizer and the
that the latter will break a word containing characters written in different
scripts (for example,
βeta) into separate tokens—
former will emit the word as a single token: