Elasticserch教程(34) 中文ik分词器 pinyin 首字母 search

中文ik分词器 pinyin 首字母 search_as_you_type 组合使用

1. 前言
2. 中文
- 2.1 ik分析器
- - 2.1.1 用ik_smart分析
  - 2.1.2 用ik_max_word分析
- 2.2 standard分析器
- - 2.2.1 创建测试数据
  - 2.2.2 match简单搜索
  - 2.2.3 match搜索，设置operator
- 2.3 search_as_you_type字段类型
- - 2.3.1 search_as_you_type简单使用
  - 2.3.2 search_as_you_type和match设置operator的区别
3. pinyin分词

1. 前言

最近想设计一个股票名称的搜索框，想达到如下的几个效果：根据中文、拼音、拼音首字母、中文拼音混合的几个效果。

2. 中文

因为股票的名称基本3~4个汉字，对于股票中文名称的分词，考虑到ik、 standard、search_as_you_type这3种方式。
下面对这3种方式一一验证。

2.1 ik分析器

对于中文分词，首先想到了大名鼎鼎的ik分词器，但是考虑到我们这个特殊的场景：3~4个汉字，就得好好测试并掂量下是否适合用ik。

2.1.1 用ik_smart分析

先用ik_smart分析看下：

POST /_analyze
{
  "analyzer": "ik_smart", 
  "text": "建设银行"
}

发现只返回了一个"建设银行"，这个。。。满足不了要求。

{
  "tokens" : [
    {
      "token" : "建设银行",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    }
  ]
}

2.1.2 用ik_max_word分析

POST /_analyze
{
  "analyzer": "ik_max_word", 
  "text": "建设银行"
}

返回结果多了"建设"和"银行"

{
  "tokens" : [
    {
      "token" : "建设银行",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "建设",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "银行",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

但是这依旧不能满足真实场景，比如客户喜欢搜索"建行"，这样就匹配不到了。
虽然ik可以添加自定义的字典，但是考虑到该场景和工作量就放弃ik了。

2.2 standard分析器

elasticsearch默认自带standard分析器，他比较适合英语这样的西方语音，它对中文的处理很不好，就是简单的分割成一个个汉字。但是我感觉它依旧适合这样的特殊场景。

POST /_analyze
{
  "analyzer": "standard", 
  "text": "建设银行"
}

结果如下：

{
  "tokens" : [
    {
      "token" : "建",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "设",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "银",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "行",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "",
      "position" : 3
    }
  ]
}

2.2.1 创建测试数据

创建2条数据，他们都有“银行”这个词，那他们被分词有都有“银”和“行”这2个字。

POST mytest/_doc/1
{
  "name": "建设银行"
}

POST mytest/_doc/2
{
  "name": "工商银行"
}

在倒排索引里就有了类似如下的表

词	文档ID
建	1
设	1
银	1，2
行	1，2
工	1
商	1

2.2.2 match简单搜索

在name字段上match搜索"建设银行"，会用“建”、“设”、“银”、“行”这4个词去匹配文档。

GET /mytest/_search
{
  "query": {
    "match": {
      "name": "建设银行"
    }
  }
}

结果匹配到2个文档，因为文档1匹配了4个词，文档2匹配了2个词，所以文档1的score更高些，排在前面。

"hits" : [
  {
    "_index" : "mytest",
    "_type" : "_doc",
    "_id" : "1",
    "_score" : 1.7509373,
    "_source" : {
      "name" : "建设银行"
    }
  },
  {
    "_index" : "mytest",
    "_type" : "_doc",
    "_id" : "2",
    "_score" : 0.36464313,
    "_source" : {
      "name" : "工商银行"
    }
  }
]

2.2.3 match搜索，设置operator

设置 “operator”: “and”，意思文档必须全部匹配“建”、“设”、“银”、“行”这4个词

GET /mytest/_search
{
  "query": {
    "match": {
      "name": {
        "query": "建设银行",
        "operator": "and"
      }
    }
  }
}

结果返回文档1

"hits" : [
  {
    "_index" : "mytest",
    "_type" : "_doc",
    "_id" : "1",
    "_score" : 1.7509373,
    "_source" : {
      "name" : "建设银行"
    }
  }
]

设置 “operator”: “and”，意思文档必须全部匹配“建”、“行”这2个词

GET /mytest/_search
{
  "query": {
    "match": {
      "name": {
        "query": "建行",
        "operator": "and"
      }
    }
  }
}

结果返回文档1

"hits" : [
  {
    "_index" : "mytest",
    "_type" : "_doc",
    "_id" : "1",
    "_score" : 0.8754687,
    "_source" : {
      "name" : "建设银行"
    }
  }
]

这样就大概可以实现如下的效果：

2.3 search_as_you_type字段类型 2.3.1 search_as_you_type简单使用

search_as_you_type字段类型确实方便，可以先看官网文档search_as_you_type来学习。

DELETE mytest

PUT mytest
{
    "mappings":{
        "properties":{
            "name":{
                "type":"search_as_you_type"
            }
        }
    }
}

es会自动创建如下4个字段：

字段	说明
name	按照 mapping 中的配置进行分析。如果未配置分析器，则使用索引的默认分词器
name ._2gram	用大小为 2 的 shingle token filter 分词器进行分词
name ._3gram	用大小为 3 的 shingle token filter 分词器进行分词
name ._index_prefix	用 edge ngram token filter 包装上面的 name ._3gram 的分词器

下面对name = "建设银行"这个短语分析

字段	分析后结果
name	“建” 、“设”、“银”、“行”
name ._2gram	“建设”、“设银”、“银行”
name ._3gram	“建设银”、“设银行”
name ._index_prefix	“建”、"建 "、“建设”、"建设 "、“建设银” 等等

name ._index_prefix的结果很多，表格放不下，可以通过下面语句查看

POST mytest/_analyze
{
  "field": "name._index_prefix",
  "text": ["建设银行"]
}

//可以配置到建设银行
GET /mytest/_search
{
  "query": {
    "match": {
      "name._2gram": {
        "query": "建设"
      }
    }
  }
}

//可以配置到建设银行、工商银行
GET /mytest/_search
{
  "query": {
    "match": {
      "name._2gram": {
        "query": "银行"
      }
    }
  }
}

//可以配置到建设银行
GET /mytest/_search
{
  "query": {
    "match": {
      "name._3gram": {
        "query": "建设银"
      }
    }
  }
}

2.3.2 search_as_you_type和match设置operator的区别

再创建下面2个文档

POST mytest/_doc/3
{
  "name": "中油工程"
}

POST mytest/_doc/4
{
  "name": "中国石油"
}

//这个能匹配到中油工程和中国石油，且2者评分一样
GET /mytest/_search
{
  "query": {
    "match": {
      "name": {
        "query": "中油",
        "operator": "and"
      }
    }
  }
}
//这个只能匹配到中油工程
GET /mytest/_search
{
  "query": {
    "match": {
      "name._2gram": {
        "query": "中油"
      }
    }
  }
}

在search_as_you_type字段的_2gram、_3gram上查询时，关键词的前后顺序是不会变的。这样就能实现类似如下的功能：

那这里要求“建行”能匹配到"建设银行"，而"中油"不能匹配到"中国石油"，我个人认为可以用到同义词这个功能，把"建行"设置为"建设银行"的同义词。比如没有设置"中渔"为"中水渔业"的同义词，那么就不会根据"中渔"搜索到"中水渔业"。

3. pinyin分词

ik和pinyin是同一个作者，可以看他的说明案例pinyin。
我想把中文搜索和拼音搜索结合使用，所以就设计了如下的mapping，可能不太合理，但是先干再说。

PUT mytest
{
    "settings":{
        "analysis":{
            "analyzer":{
                "pinyin_analyzer":{
                    "tokenizer":"my_pinyin"
                }
            },
            "tokenizer":{
                "my_pinyin":{
                    "type":"pinyin",
                    "keep_first_letter":true,
                    "keep_separate_first_letter":true,
                    "keep_full_pinyin":true,
                    "keep_original":false,
                    "limit_first_letter_length":16,
                    "lowercase":true
                }
            },
            "filter":{
            }
        }
    },
    "mappings":{
        "properties":{
            "name":{
                "type":"text",
                "analyzer":"standard",
                "fields":{
                    "keyword":{
                        "type":"keyword"
                    },
                    "search":{
                        "type":"search_as_you_type"
                    },
                    "pinyin":{
                      "type": "text",
                      "analyzer": "pinyin_analyzer"
                    }
                }
            }
        }
    }
}

//可以匹配到中国石油
GET mytest/_search
{
  "query": {
      "match": {
        "name.pinyin": {
          "query": "zgsy",
          "operator": "and"
        }
      }
  }
}

//可以匹配到中国石油
GET mytest/_search
{
  "query": {
      "match": {
        "name.pinyin": {
          "query": "zgs",
          "operator": "and"
        }
      }
  }
}

当然我认为现实这个功能还得在数据入库的时候就已经把首字母简称入库，设置首字母字段类型为search_as_you_type。

这个只是初级的尝试，很多细节需要改，比如多音字"行"，pinyin分词器默认为x，但是对银行这个词为h。

本篇博客只是初级的探讨，真正开发的时候很多问题要改，加油！

Elasticserch教程(34) 中文ik分词器 pinyin 首字母 search

Java相关栏目本月热门文章