内置分析器参考
edit内置分析器参考
editElasticsearch 自带了多种内置分析器,可以在任何索引中使用,无需进一步配置:
- Standard Analyzer
-
The
standard分析器根据 Unicode 文本分段算法在单词边界上将文本分割成词条。它会移除大多数标点符号,将词条小写,并支持移除停用词。 - Simple Analyzer
-
The
simpleanalyzer 在遇到非字母字符时将文本分割成词项。它将所有词项转换为小写。 - Whitespace Analyzer
-
The
whitespaceanalyzer 在遇到任何空白字符时将文本分割成词条。它不会将词条小写化。 - Stop Analyzer
-
The
stopanalyzer 类似于simpleanalyzer,但还支持移除停用词。 - Keyword Analyzer
-
keyword分析器是一个“无操作”分析器,它接受任何给定的文本,并将其作为单个词项原样输出。 - Pattern Analyzer
-
The
patternanalyzer 使用正则表达式将文本分割成词条。 它支持小写化和停用词。 - Language Analyzers
-
Elasticsearch 提供了许多特定语言的分析器,如
english或french。 - Fingerprint Analyzer
-
fingerprint分析器是一种专门用于创建指纹的分析器,该指纹可用于检测重复项。
自定义分析器
edit如果您没有找到适合您需求的分析器,您可以创建一个
自定义 分析器,它结合了适当的
字符过滤器、
分词器 和 词元过滤器。
指纹分析器
editThe fingerprint analyzer implements a
指纹算法
which is used by the OpenRefine project to assist in clustering.
输入文本被转换为小写,进行规范化处理以去除扩展字符,排序,去重并连接成单个标记。如果配置了停用词列表,停用词也会被移除。
示例输出
editPOST _analyze
{
"analyzer": "fingerprint",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
上述句子将生成以下单一术语:
[ and consistent godel is said sentence this yes ]
配置
edit指纹分析器接受以下参数:
|
|
用于连接术语的字符。默认为空格。 |
|
|
发射的最大令牌大小。默认为 |
|
|
一个预定义的停用词列表,如 |
|
|
包含停用词的文件路径。 |
有关停用词配置的更多信息,请参阅停用词标记过滤器。
示例配置
edit在这个示例中,我们配置了 fingerprint 分析器以使用预定义的英语停用词列表:
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer": {
"type": "fingerprint",
"stopwords": "_english_"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_fingerprint_analyzer",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
上述示例产生以下术语:
[ consistent godel said sentence yes ]
定义
edit指纹分词器由以下部分组成:
如果您需要自定义 fingerprint 分析器超出配置参数,那么您需要将其重新创建为 custom 分析器并进行修改,通常通过添加标记过滤器来实现。这将重新创建内置的 fingerprint 分析器,您可以将其作为进一步自定义的起点:
PUT /fingerprint_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_fingerprint": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"fingerprint"
]
}
}
}
}
}
关键词分析器
editThe keyword analyzer 是一个“noop”分析器,它将整个输入字符串作为一个单一的标记返回。
示例输出
editPOST _analyze
{
"analyzer": "keyword",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
上述句子将生成以下单一术语:
[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]
配置
editThe keyword 分析器是不可配置的。
定义
editThe keyword analyzer 由以下部分组成:
- Tokenizer
如果你需要自定义 keyword 分析器,那么你需要将其重新创建为一个 custom 分析器并进行修改,通常是通过添加分词过滤器。通常情况下,当你想要不分割成标记的字符串时,你应该首选 Keyword 类型,但以防万一你需要它,这将重新创建内置的 keyword 分析器,你可以将其用作进一步自定义的起点:
语言分析器
edit一组旨在分析特定语言文本的分析器。支持以下类型:
arabic,
armenian,
basque,
bengali,
brazilian,
bulgarian,
catalan,
cjk,
czech,
danish,
dutch,
english,
estonian,
finnish,
french,
galician,
german,
greek,
hindi,
hungarian,
indonesian,
irish,
italian,
latvian,
lithuanian,
norwegian,
persian,
portuguese,
romanian,
russian,
serbian,
sorani,
spanish,
swedish,
turkish,
thai.
配置语言分析器
edit排除词干提取中的词语
editstem_exclusion 参数允许您指定一个不应被词干化的单词数组。在内部,此功能通过添加 keyword_marker 标记过滤器 并将其 keywords 设置为 stem_exclusion 参数的值来实现。
以下分析器支持设置自定义stem_exclusion列表:
arabic, armenian, basque, bengali, bulgarian, catalan, czech,
dutch, english, finnish, french, galician,
german, hindi, hungarian, indonesian, irish, italian, latvian,
lithuanian, norwegian, portuguese, romanian, russian, serbian,
sorani, spanish, swedish, turkish。
重新实现语言分析器
edit内置的语言分析器可以被重新实现为自定义分析器(如下所述),以便自定义它们的行为。
如果你不打算排除单词进行词干提取(相当于上面的stem_exclusion参数),那么你应该从自定义分析器配置中移除keyword_marker标记过滤器。
阿拉伯 分析器
edit可以将 arabic 分析器重新实现为 custom 分析器,如下所示:
PUT /arabic_example
{
"settings": {
"analysis": {
"filter": {
"arabic_stop": {
"type": "stop",
"stopwords": "_arabic_"
},
"arabic_keywords": {
"type": "keyword_marker",
"keywords": ["مثال"]
},
"arabic_stemmer": {
"type": "stemmer",
"language": "arabic"
}
},
"analyzer": {
"rebuilt_arabic": {
"tokenizer": "standard",
"filter": [
"lowercase",
"decimal_digit",
"arabic_stop",
"arabic_normalization",
"arabic_keywords",
"arabic_stemmer"
]
}
}
}
}
}
亚美尼亚分析器
edit可以将 armenian 分析器重新实现为 custom 分析器,如下所示:
PUT /armenian_example
{
"settings": {
"analysis": {
"filter": {
"armenian_stop": {
"type": "stop",
"stopwords": "_armenian_"
},
"armenian_keywords": {
"type": "keyword_marker",
"keywords": ["օրինակ"]
},
"armenian_stemmer": {
"type": "stemmer",
"language": "armenian"
}
},
"analyzer": {
"rebuilt_armenian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"armenian_stop",
"armenian_keywords",
"armenian_stemmer"
]
}
}
}
}
}
巴斯克 分析器
edit可以将 basque 分析器重新实现为 custom 分析器,如下所示:
PUT /basque_example
{
"settings": {
"analysis": {
"filter": {
"basque_stop": {
"type": "stop",
"stopwords": "_basque_"
},
"basque_keywords": {
"type": "keyword_marker",
"keywords": ["Adibidez"]
},
"basque_stemmer": {
"type": "stemmer",
"language": "basque"
}
},
"analyzer": {
"rebuilt_basque": {
"tokenizer": "standard",
"filter": [
"lowercase",
"basque_stop",
"basque_keywords",
"basque_stemmer"
]
}
}
}
}
}
孟加拉语 分析器
edit可以将 bengali 分析器重新实现为 custom 分析器,如下所示:
PUT /bengali_example
{
"settings": {
"analysis": {
"filter": {
"bengali_stop": {
"type": "stop",
"stopwords": "_bengali_"
},
"bengali_keywords": {
"type": "keyword_marker",
"keywords": ["উদাহরণ"]
},
"bengali_stemmer": {
"type": "stemmer",
"language": "bengali"
}
},
"analyzer": {
"rebuilt_bengali": {
"tokenizer": "standard",
"filter": [
"lowercase",
"decimal_digit",
"bengali_keywords",
"indic_normalization",
"bengali_normalization",
"bengali_stop",
"bengali_stemmer"
]
}
}
}
}
}
巴西 分析器
edit可以将 brazilian 分析器重新实现为 custom 分析器,如下所示:
PUT /brazilian_example
{
"settings": {
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_keywords": {
"type": "keyword_marker",
"keywords": ["exemplo"]
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"rebuilt_brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_keywords",
"brazilian_stemmer"
]
}
}
}
}
}
bulgarian 分析器
edit可以将 bulgarian 分析器重新实现为 custom 分析器,如下所示:
PUT /bulgarian_example
{
"settings": {
"analysis": {
"filter": {
"bulgarian_stop": {
"type": "stop",
"stopwords": "_bulgarian_"
},
"bulgarian_keywords": {
"type": "keyword_marker",
"keywords": ["пример"]
},
"bulgarian_stemmer": {
"type": "stemmer",
"language": "bulgarian"
}
},
"analyzer": {
"rebuilt_bulgarian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"bulgarian_stop",
"bulgarian_keywords",
"bulgarian_stemmer"
]
}
}
}
}
}
catalan 分析器
edit可以将 catalan 分析器重新实现为 custom 分析器,如下所示:
PUT /catalan_example
{
"settings": {
"analysis": {
"filter": {
"catalan_elision": {
"type": "elision",
"articles": [ "d", "l", "m", "n", "s", "t"],
"articles_case": true
},
"catalan_stop": {
"type": "stop",
"stopwords": "_catalan_"
},
"catalan_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"catalan_stemmer": {
"type": "stemmer",
"language": "catalan"
}
},
"analyzer": {
"rebuilt_catalan": {
"tokenizer": "standard",
"filter": [
"catalan_elision",
"lowercase",
"catalan_stop",
"catalan_keywords",
"catalan_stemmer"
]
}
}
}
}
}
cjk 分析器
edit你可能会发现,ICU 分析插件中的 icu_analyzer 对于 CJK 文本的效果比 cjk 分析器更好。请对你的文本和查询进行实验。
The cjk analyzer 可以被重新实现为一个 custom analyzer,如下所示:
PUT /cjk_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": [
"a", "and", "are", "as", "at", "be", "but", "by", "for",
"if", "in", "into", "is", "it", "no", "not", "of", "on",
"or", "s", "such", "t", "that", "the", "their", "then",
"there", "these", "they", "this", "to", "was", "will",
"with", "www"
]
}
},
"analyzer": {
"rebuilt_cjk": {
"tokenizer": "standard",
"filter": [
"cjk_width",
"lowercase",
"cjk_bigram",
"english_stop"
]
}
}
}
}
}
czech 分析器
edit可以将 czech 分析器重新实现为 custom 分析器,如下所示:
PUT /czech_example
{
"settings": {
"analysis": {
"filter": {
"czech_stop": {
"type": "stop",
"stopwords": "_czech_"
},
"czech_keywords": {
"type": "keyword_marker",
"keywords": ["příklad"]
},
"czech_stemmer": {
"type": "stemmer",
"language": "czech"
}
},
"analyzer": {
"rebuilt_czech": {
"tokenizer": "standard",
"filter": [
"lowercase",
"czech_stop",
"czech_keywords",
"czech_stemmer"
]
}
}
}
}
}
丹麦 分析器
edit可以将 danish 分析器重新实现为 custom 分析器,如下所示:
PUT /danish_example
{
"settings": {
"analysis": {
"filter": {
"danish_stop": {
"type": "stop",
"stopwords": "_danish_"
},
"danish_keywords": {
"type": "keyword_marker",
"keywords": ["eksempel"]
},
"danish_stemmer": {
"type": "stemmer",
"language": "danish"
}
},
"analyzer": {
"rebuilt_danish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"danish_stop",
"danish_keywords",
"danish_stemmer"
]
}
}
}
}
}
荷兰 分析器
edit可以将 dutch 分析器重新实现为 custom 分析器,如下所示:
PUT /dutch_example
{
"settings": {
"analysis": {
"filter": {
"dutch_stop": {
"type": "stop",
"stopwords": "_dutch_"
},
"dutch_keywords": {
"type": "keyword_marker",
"keywords": ["voorbeeld"]
},
"dutch_stemmer": {
"type": "stemmer",
"language": "dutch"
},
"dutch_override": {
"type": "stemmer_override",
"rules": [
"fiets=>fiets",
"bromfiets=>bromfiets",
"ei=>eier",
"kind=>kinder"
]
}
},
"analyzer": {
"rebuilt_dutch": {
"tokenizer": "standard",
"filter": [
"lowercase",
"dutch_stop",
"dutch_keywords",
"dutch_override",
"dutch_stemmer"
]
}
}
}
}
}
英语 分析器
edit可以将 english 分析器重新实现为 custom 分析器,如下所示:
PUT /english_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"rebuilt_english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
estonian 分析器
edit可以将 estonian 分析器重新实现为 custom 分析器,如下所示:
PUT /estonian_example
{
"settings": {
"analysis": {
"filter": {
"estonian_stop": {
"type": "stop",
"stopwords": "_estonian_"
},
"estonian_keywords": {
"type": "keyword_marker",
"keywords": ["näide"]
},
"estonian_stemmer": {
"type": "stemmer",
"language": "estonian"
}
},
"analyzer": {
"rebuilt_estonian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"estonian_stop",
"estonian_keywords",
"estonian_stemmer"
]
}
}
}
}
}
finnish 分析器
edit可以将 finnish 分析器重新实现为 custom 分析器,如下所示:
PUT /finnish_example
{
"settings": {
"analysis": {
"filter": {
"finnish_stop": {
"type": "stop",
"stopwords": "_finnish_"
},
"finnish_keywords": {
"type": "keyword_marker",
"keywords": ["esimerkki"]
},
"finnish_stemmer": {
"type": "stemmer",
"language": "finnish"
}
},
"analyzer": {
"rebuilt_finnish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"finnish_stop",
"finnish_keywords",
"finnish_stemmer"
]
}
}
}
}
}
法语 分析器
edit可以将 french 分析器重新实现为 custom 分析器,如下所示:
PUT /french_example
{
"settings": {
"analysis": {
"filter": {
"french_elision": {
"type": "elision",
"articles_case": true,
"articles": [
"l", "m", "t", "qu", "n", "s",
"j", "d", "c", "jusqu", "quoiqu",
"lorsqu", "puisqu"
]
},
"french_stop": {
"type": "stop",
"stopwords": "_french_"
},
"french_keywords": {
"type": "keyword_marker",
"keywords": ["Example"]
},
"french_stemmer": {
"type": "stemmer",
"language": "light_french"
}
},
"analyzer": {
"rebuilt_french": {
"tokenizer": "standard",
"filter": [
"french_elision",
"lowercase",
"french_stop",
"french_keywords",
"french_stemmer"
]
}
}
}
}
}
加利西亚 分析器
edit可以将 galician 分析器重新实现为 custom 分析器,如下所示:
PUT /galician_example
{
"settings": {
"analysis": {
"filter": {
"galician_stop": {
"type": "stop",
"stopwords": "_galician_"
},
"galician_keywords": {
"type": "keyword_marker",
"keywords": ["exemplo"]
},
"galician_stemmer": {
"type": "stemmer",
"language": "galician"
}
},
"analyzer": {
"rebuilt_galician": {
"tokenizer": "standard",
"filter": [
"lowercase",
"galician_stop",
"galician_keywords",
"galician_stemmer"
]
}
}
}
}
}
德语 分析器
edit可以将 german 分析器重新实现为 custom 分析器,如下所示:
PUT /german_example
{
"settings": {
"analysis": {
"filter": {
"german_stop": {
"type": "stop",
"stopwords": "_german_"
},
"german_keywords": {
"type": "keyword_marker",
"keywords": ["Beispiel"]
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
}
},
"analyzer": {
"rebuilt_german": {
"tokenizer": "standard",
"filter": [
"lowercase",
"german_stop",
"german_keywords",
"german_normalization",
"german_stemmer"
]
}
}
}
}
}
希腊 分析器
edit可以将 greek 分析器重新实现为 custom 分析器,如下所示:
PUT /greek_example
{
"settings": {
"analysis": {
"filter": {
"greek_stop": {
"type": "stop",
"stopwords": "_greek_"
},
"greek_lowercase": {
"type": "lowercase",
"language": "greek"
},
"greek_keywords": {
"type": "keyword_marker",
"keywords": ["παράδειγμα"]
},
"greek_stemmer": {
"type": "stemmer",
"language": "greek"
}
},
"analyzer": {
"rebuilt_greek": {
"tokenizer": "standard",
"filter": [
"greek_lowercase",
"greek_stop",
"greek_keywords",
"greek_stemmer"
]
}
}
}
}
}
hindi 分析器
edit可以将 hindi 分析器重新实现为 custom 分析器,如下所示:
PUT /hindi_example
{
"settings": {
"analysis": {
"filter": {
"hindi_stop": {
"type": "stop",
"stopwords": "_hindi_"
},
"hindi_keywords": {
"type": "keyword_marker",
"keywords": ["उदाहरण"]
},
"hindi_stemmer": {
"type": "stemmer",
"language": "hindi"
}
},
"analyzer": {
"rebuilt_hindi": {
"tokenizer": "standard",
"filter": [
"lowercase",
"decimal_digit",
"hindi_keywords",
"indic_normalization",
"hindi_normalization",
"hindi_stop",
"hindi_stemmer"
]
}
}
}
}
}
匈牙利 分析器
edit可以将 hungarian 分析器重新实现为 custom 分析器,如下所示:
PUT /hungarian_example
{
"settings": {
"analysis": {
"filter": {
"hungarian_stop": {
"type": "stop",
"stopwords": "_hungarian_"
},
"hungarian_keywords": {
"type": "keyword_marker",
"keywords": ["példa"]
},
"hungarian_stemmer": {
"type": "stemmer",
"language": "hungarian"
}
},
"analyzer": {
"rebuilt_hungarian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"hungarian_stop",
"hungarian_keywords",
"hungarian_stemmer"
]
}
}
}
}
}
indonesian 分析器
edit可以将 indonesian 分析器重新实现为 custom 分析器,如下所示:
PUT /indonesian_example
{
"settings": {
"analysis": {
"filter": {
"indonesian_stop": {
"type": "stop",
"stopwords": "_indonesian_"
},
"indonesian_keywords": {
"type": "keyword_marker",
"keywords": ["contoh"]
},
"indonesian_stemmer": {
"type": "stemmer",
"language": "indonesian"
}
},
"analyzer": {
"rebuilt_indonesian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"indonesian_stop",
"indonesian_keywords",
"indonesian_stemmer"
]
}
}
}
}
}
爱尔兰 分析器
edit可以将 irish 分析器重新实现为 custom 分析器,如下所示:
PUT /irish_example
{
"settings": {
"analysis": {
"filter": {
"irish_hyphenation": {
"type": "stop",
"stopwords": [ "h", "n", "t" ],
"ignore_case": true
},
"irish_elision": {
"type": "elision",
"articles": [ "d", "m", "b" ],
"articles_case": true
},
"irish_stop": {
"type": "stop",
"stopwords": "_irish_"
},
"irish_lowercase": {
"type": "lowercase",
"language": "irish"
},
"irish_keywords": {
"type": "keyword_marker",
"keywords": ["sampla"]
},
"irish_stemmer": {
"type": "stemmer",
"language": "irish"
}
},
"analyzer": {
"rebuilt_irish": {
"tokenizer": "standard",
"filter": [
"irish_hyphenation",
"irish_elision",
"irish_lowercase",
"irish_stop",
"irish_keywords",
"irish_stemmer"
]
}
}
}
}
}
意大利 分析器
edit可以将 italian 分析器重新实现为 custom 分析器,如下所示:
PUT /italian_example
{
"settings": {
"analysis": {
"filter": {
"italian_elision": {
"type": "elision",
"articles": [
"c", "l", "all", "dall", "dell",
"nell", "sull", "coll", "pell",
"gl", "agl", "dagl", "degl", "negl",
"sugl", "un", "m", "t", "s", "v", "d"
],
"articles_case": true
},
"italian_stop": {
"type": "stop",
"stopwords": "_italian_"
},
"italian_keywords": {
"type": "keyword_marker",
"keywords": ["esempio"]
},
"italian_stemmer": {
"type": "stemmer",
"language": "light_italian"
}
},
"analyzer": {
"rebuilt_italian": {
"tokenizer": "standard",
"filter": [
"italian_elision",
"lowercase",
"italian_stop",
"italian_keywords",
"italian_stemmer"
]
}
}
}
}
}
latvian 分析器
edit可以将 latvian 分析器重新实现为 custom 分析器,如下所示:
PUT /latvian_example
{
"settings": {
"analysis": {
"filter": {
"latvian_stop": {
"type": "stop",
"stopwords": "_latvian_"
},
"latvian_keywords": {
"type": "keyword_marker",
"keywords": ["piemērs"]
},
"latvian_stemmer": {
"type": "stemmer",
"language": "latvian"
}
},
"analyzer": {
"rebuilt_latvian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"latvian_stop",
"latvian_keywords",
"latvian_stemmer"
]
}
}
}
}
}
立陶宛分析器
edit可以将 lithuanian 分析器重新实现为 custom 分析器,如下所示:
PUT /lithuanian_example
{
"settings": {
"analysis": {
"filter": {
"lithuanian_stop": {
"type": "stop",
"stopwords": "_lithuanian_"
},
"lithuanian_keywords": {
"type": "keyword_marker",
"keywords": ["pavyzdys"]
},
"lithuanian_stemmer": {
"type": "stemmer",
"language": "lithuanian"
}
},
"analyzer": {
"rebuilt_lithuanian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
挪威语 分析器
edit可以将 norwegian 分析器重新实现为 custom 分析器,如下所示:
PUT /norwegian_example
{
"settings": {
"analysis": {
"filter": {
"norwegian_stop": {
"type": "stop",
"stopwords": "_norwegian_"
},
"norwegian_keywords": {
"type": "keyword_marker",
"keywords": ["eksempel"]
},
"norwegian_stemmer": {
"type": "stemmer",
"language": "norwegian"
}
},
"analyzer": {
"rebuilt_norwegian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"norwegian_stop",
"norwegian_keywords",
"norwegian_stemmer"
]
}
}
}
}
}
波斯 分析器
edit可以将 persian 分析器重新实现为 custom 分析器,如下所示:
PUT /persian_example
{
"settings": {
"analysis": {
"char_filter": {
"zero_width_spaces": {
"type": "mapping",
"mappings": [ "\\u200C=>\\u0020"]
}
},
"filter": {
"persian_stop": {
"type": "stop",
"stopwords": "_persian_"
}
},
"analyzer": {
"rebuilt_persian": {
"tokenizer": "standard",
"char_filter": [ "zero_width_spaces" ],
"filter": [
"lowercase",
"decimal_digit",
"arabic_normalization",
"persian_normalization",
"persian_stop"
]
}
}
}
}
}
葡萄牙语 分析器
edit可以将 portuguese 分析器重新实现为 custom 分析器,如下所示:
PUT /portuguese_example
{
"settings": {
"analysis": {
"filter": {
"portuguese_stop": {
"type": "stop",
"stopwords": "_portuguese_"
},
"portuguese_keywords": {
"type": "keyword_marker",
"keywords": ["exemplo"]
},
"portuguese_stemmer": {
"type": "stemmer",
"language": "light_portuguese"
}
},
"analyzer": {
"rebuilt_portuguese": {
"tokenizer": "standard",
"filter": [
"lowercase",
"portuguese_stop",
"portuguese_keywords",
"portuguese_stemmer"
]
}
}
}
}
}
romanian 分析器
edit可以将 romanian 分析器重新实现为 custom 分析器,如下所示:
PUT /romanian_example
{
"settings": {
"analysis": {
"filter": {
"romanian_stop": {
"type": "stop",
"stopwords": "_romanian_"
},
"romanian_keywords": {
"type": "keyword_marker",
"keywords": ["exemplu"]
},
"romanian_stemmer": {
"type": "stemmer",
"language": "romanian"
}
},
"analyzer": {
"rebuilt_romanian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"romanian_stop",
"romanian_keywords",
"romanian_stemmer"
]
}
}
}
}
}
俄语 分析器
edit可以将 russian 分析器重新实现为 custom 分析器,如下所示:
PUT /russian_example
{
"settings": {
"analysis": {
"filter": {
"russian_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"russian_keywords": {
"type": "keyword_marker",
"keywords": ["пример"]
},
"russian_stemmer": {
"type": "stemmer",
"language": "russian"
}
},
"analyzer": {
"rebuilt_russian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer"
]
}
}
}
}
}
塞尔维亚 分析器
edit可以将 serbian 分析器重新实现为 custom 分析器,如下所示:
PUT /serbian_example
{
"settings": {
"analysis": {
"filter": {
"serbian_stop": {
"type": "stop",
"stopwords": "_serbian_"
},
"serbian_keywords": {
"type": "keyword_marker",
"keywords": ["пример"]
},
"serbian_stemmer": {
"type": "stemmer",
"language": "serbian"
}
},
"analyzer": {
"rebuilt_serbian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"serbian_stop",
"serbian_keywords",
"serbian_stemmer",
"serbian_normalization"
]
}
}
}
}
}
sorani 分析器
edit可以将 sorani 分析器重新实现为 custom 分析器,如下所示:
PUT /sorani_example
{
"settings": {
"analysis": {
"filter": {
"sorani_stop": {
"type": "stop",
"stopwords": "_sorani_"
},
"sorani_keywords": {
"type": "keyword_marker",
"keywords": ["mînak"]
},
"sorani_stemmer": {
"type": "stemmer",
"language": "sorani"
}
},
"analyzer": {
"rebuilt_sorani": {
"tokenizer": "standard",
"filter": [
"sorani_normalization",
"lowercase",
"decimal_digit",
"sorani_stop",
"sorani_keywords",
"sorani_stemmer"
]
}
}
}
}
}
西班牙 分析器
edit可以将 spanish 分析器重新实现为 custom 分析器,如下所示:
PUT /spanish_example
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": ["ejemplo"]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"rebuilt_spanish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer"
]
}
}
}
}
}
瑞典 分析器
edit可以将 swedish 分析器重新实现为 custom 分析器,如下所示:
PUT /swedish_example
{
"settings": {
"analysis": {
"filter": {
"swedish_stop": {
"type": "stop",
"stopwords": "_swedish_"
},
"swedish_keywords": {
"type": "keyword_marker",
"keywords": ["exempel"]
},
"swedish_stemmer": {
"type": "stemmer",
"language": "swedish"
}
},
"analyzer": {
"rebuilt_swedish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"swedish_stop",
"swedish_keywords",
"swedish_stemmer"
]
}
}
}
}
}
土耳其 分析器
edit可以将 turkish 分析器重新实现为 custom 分析器,如下所示:
PUT /turkish_example
{
"settings": {
"analysis": {
"filter": {
"turkish_stop": {
"type": "stop",
"stopwords": "_turkish_"
},
"turkish_lowercase": {
"type": "lowercase",
"language": "turkish"
},
"turkish_keywords": {
"type": "keyword_marker",
"keywords": ["örnek"]
},
"turkish_stemmer": {
"type": "stemmer",
"language": "turkish"
}
},
"analyzer": {
"rebuilt_turkish": {
"tokenizer": "standard",
"filter": [
"apostrophe",
"turkish_lowercase",
"turkish_stop",
"turkish_keywords",
"turkish_stemmer"
]
}
}
}
}
}
thai 分析器
edit可以将 thai 分析器重新实现为 custom 分析器,如下所示:
模式分析器
editThe pattern analyzer 使用正则表达式将文本分割成词条。
正则表达式应匹配词条分隔符而不是词条本身。正则表达式的默认值为 \W+(即所有非单词字符)。
当心病态的正则表达式
模式分析器使用 Java 正则表达式。
一个编写糟糕的正则表达式可能会运行得非常缓慢,甚至抛出StackOverflowError,并导致运行它的节点突然退出。
了解更多关于病态正则表达式及其避免方法。
示例输出
editPOST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
上述句子将生成以下术语:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
配置
editThe pattern analyzer 接受以下参数:
|
|
一个 Java 正则表达式,默认为 |
|
|
Java 正则表达式 标志。
标志应使用竖线分隔,例如 |
|
|
术语是否应小写。默认为 |
|
|
一个预定义的停用词列表,如 |
|
|
包含停用词的文件路径。 |
有关停用词配置的更多信息,请参阅停用词标记过滤器。
示例配置
edit在这个例子中,我们配置了 pattern 分析器,以在非单词字符或下划线(\W|_)处分割电子邮件地址,并将结果转换为小写:
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer": {
"type": "pattern",
"pattern": "\\W|_",
"lowercase": true
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_email_analyzer",
"text": "John_Smith@foo-bar.com"
}
上述示例产生以下术语:
[ john, smith, foo, bar, com ]
驼峰式分词器
edit以下更复杂的示例将驼峰式文本拆分为标记:
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}
GET my-index-000001/_analyze
{
"analyzer": "camel",
"text": "MooseX::FTPClass2_beta"
}
上述示例产生以下术语:
[ moose, x, ftp, class, 2, beta ]
上面的正则表达式更容易理解为:
([^\p{L}\d]+) # swallow non letters and numbers,
| (?<=\D)(?=\d) # or non-number followed by number,
| (?<=\d)(?=\D) # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
(?=\p{Lu}) # followed by upper case,
| (?<=\p{Lu}) # or upper case
(?=\p{Lu} # followed by upper case
[\p{L}&&[^\p{Lu}]] # then lower case
)
定义
editThe pattern analyzer consists of:
如果您需要自定义 pattern 分析器超出配置参数的范围,那么您需要将其重新创建为 custom 分析器并进行修改,通常是通过添加分词过滤器。这将重新创建内置的 pattern 分析器,您可以将其作为进一步自定义的起点:
简单分析器
editThe simple analyzer 在任何非字母字符(如数字、空格、连字符和撇号)处分割文本为标记,丢弃非字母字符,并将大写字母转换为小写。
示例
editPOST _analyze
{
"analyzer": "simple",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The simple analyzer 解析句子并生成以下标记:
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
自定义
edit要自定义 simple 分析器,请复制它以创建自定义分析器的基础。可以根据需要修改此自定义分析器,通常通过添加词元过滤器来实现。
标准分析器
edit如果没有指定分析器,standard 分析器是默认使用的分析器。它提供了基于语法的分词(基于 Unicode 标准附件 #29 中指定的 Unicode 文本分段算法),并且适用于大多数语言。
示例输出
editPOST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
上述句子将生成以下术语:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
配置
editThe standard analyzer 接受以下参数:
|
|
最大标记长度。如果遇到超过此长度的标记,则会在 |
|
|
一个预定义的停用词列表,如 |
|
|
包含停用词的文件路径。 |
有关停用词配置的更多信息,请参阅停用词标记过滤器。
示例配置
edit在这个示例中,我们将 standard 分析器配置为具有 5 的 max_token_length(仅用于演示目的),并使用预定义的英语停用词列表:
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_english_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
上述示例产生以下术语:
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
定义
editThe standard analyzer 由以下部分组成:
如果您需要自定义standard分析器超出配置参数,那么您需要将其重新创建为custom分析器并进行修改,通常通过添加标记过滤器来实现。这将重新创建内置的standard分析器,您可以将其作为起点:
停止分析器
editThe stop analyzer 与 simple analyzer 相同,但增加了移除停用词的支持。它默认使用 _english_ 停用词。
示例输出
editPOST _analyze
{
"analyzer": "stop",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
上述句子将生成以下术语:
[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]
配置
editThe stop analyzer 接受以下参数:
|
|
一个预定义的停用词列表,如 |
|
|
包含停用词的文件路径。此路径相对于Elasticsearch |
有关停用词配置的更多信息,请参阅停用词标记过滤器。
示例配置
edit在这个例子中,我们配置 stop 分析器使用指定的单词列表作为停用词:
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_stop_analyzer": {
"type": "stop",
"stopwords": ["the", "over"]
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_stop_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
上述示例产生以下术语:
[ quick, brown, foxes, jumped, lazy, dog, s, bone ]
定义
edit它包括:
如果你需要自定义stop分析器超出配置参数的范围,那么你需要将其重新创建为一个custom分析器并进行修改,通常是通过添加词元过滤器。这将重新创建内置的stop分析器,你可以将其作为进一步自定义的起点:
空白分析器
edit当遇到空白字符时,whitespace 分析器会将文本拆分为词项。
示例输出
editPOST _analyze
{
"analyzer": "whitespace",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
上述句子将生成以下术语:
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
配置
editThe whitespace analyzer 是不可配置的。
定义
edit它包括:
- Tokenizer
如果你需要自定义 whitespace 分析器,那么你需要将其重新创建为一个 custom 分析器并进行修改,通常是通过添加词元过滤器。这将重新创建内置的 whitespace 分析器,你可以将其作为进一步自定义的起点: