注：本内容参考的是江南一点雨的这个公众号的文章及视频，需要数据源可以直接在他的微信公众号后台回复 bookdata.json 下载脚本，数据准备参考上一篇文章，文末有他视频和文章的链接。

复合查询

constant_score query

当我们不关心检索词项的频率（TF）对搜索结果排序的影响时，可以使用 constant_score 将查询语句或者过滤语句包裹起来。

GET books/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "name": "java"
        }
      },
      "boost": 1.5
    }
  }
}

注：设置了boost后，查询结果中的_score都会变为和他一样的值。

bool query

bool query 可以将任意多个简单查询组装在一起，有四个关键字可供选择，四个关键字所描述的条件可以有一个或者多个。

must：文档必须匹配 must 选项下的查询条件。
should：文档可以匹配 should 下的查询条件，也可以不匹配。
must_not：文档必须不满足 must_not 选项下的查询条件。
filter：类似于 must，但是 filter 不评分，只是过滤数据。

例如查询 name 属性中必须包含 java，同时书价不在 [0,35] 区间内，info 属性可以包含程序设计也可以不包含程序设计：

GET books/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "name": {
              "value": "java"
            }
          }
        }
      ],
      "must_not": [
        {
          "range": {
            "price": {
              "gte": 0,
              "lte": 35
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "info": "程序设计"
          }
        }
      ]
    }
  }
}

这里还涉及到一个关键字，minmum_should_match 参数。minmum_should_match 参数在 es 官网上称作最小匹配度。在之前学习的 multi_match 或者这里的 should 查询中，都可以设置 minmum_should_match 参数。

假设我们要做一次查询，查询 name 中包含语言程序设计关键字的文档：

GET books/_search
{
  "query": {
    "match": {
      "name": "语言程序设计"
    }
  }
}

在这个查询过程中，首先会进行分词，分词结果为：语言、程序、设计、程序设计这四个。分词后的 term 会构造成一个 should 的 bool query，每一个 term 都会变成一个 term query 的子句。换句话说，上面的查询和下面的查询等价：

GET books/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "name": {
              "value": "语言"
            }
          }
        },
        {
          "term": {
            "name": {
              "value": "程序设计"
            }
          }
        },
        {
          "term": {
            "name": {
              "value": "程序"
            }
          }
        },
        {
          "term": {
            "name": {
              "value": "设计"
            }
          }
        }
      ]
    }
  }
}

在这两个查询语句中，都是文档只需要包含词项中的任意一项即可，文档就回被返回，在 match 查询中，可以通过 operator 参数设置文档必须匹配所有词项。

GET books/_search
{
  "query": {
    "match": {
      "name": {
        "query": "语言程序设计",
        "operator": "and"
      }
    }
  }
}

如果想匹配一部分词项，就涉及到一个参数，就是 minmum_should_match，即最小匹配度。即至少匹配多少个词。

GET books/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "name": {
              "value": "语言"
            }
          }
        },
        {
          "term": {
            "name": {
              "value": "程序设计"
            }
          }
        },
        {
          "term": {
            "name": {
              "value": "程序"
            }
          }
        },
        {
          "term": {
            "name": {
              "value": "设计"
            }
          }
        }
      ],
      "minimum_should_match": "50%"
    }
  },
  "from": 0,
  "size": 70
}

**50% 表示词项个数的 50%**，如下两个查询等价（参数 4 是因为查询关键字分词后有 4 项）：

GET books/_search
{
  "query": {
    "match": {
      "name": {
        "query": "语言程序设计",
        "minimum_should_match": 4
      }
    }
  }
}
GET books/_search
{
  "query": {
    "match": {
      "name": {
        "query": "语言程序设计",
        "operator": "and"
      }
    }
  }
}

dis_max query

假设现在有两本书：

PUT blog
{
  "mappings": {
    "properties": {
      "title":{
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "content":{
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

POST blog/_doc
{
  "title":"如何通过Java代码调用ElasticSearch",
  "content":"松哥力荐，这是一篇很好的解决方案"
}

POST blog/_doc
{
  "title":"初识 MongoDB",
  "content":"简单介绍一下 MongoDB，以及如何通过 Java 调用 MongoDB，MongoDB 是一个不错 NoSQL 解决方案"
}

现在假设搜索 Java解决方案 关键字，但是不确定关键字是在 title 还是在 content，所以两者都搜索：

GET blog/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": "java解决方案"
          }
        },
        {
          "match": {
            "content": "java解决方案"
          }
        }
      ]
    }
  }
}

查询结果如下：

肉眼观察，感觉第二个和查询关键字相似度更高，但是实际查询结果并非这样。要理解这个原因，我们需要来看下 should query 中的评分策略：

首先会执行 should 中的两个查询
对两个查询结果的评分求和
对求和结果乘以匹配语句总数
在对第三步的结果除以所有语句总数

反映到具体的查询中：

前者

title 中包含 java，假设评分是 1.1
content 中包含解决方案，假设评分是 1.2
有得分的 query 数量，这里是 2
总的 query 数量也是 2，最终结果：（1.1+1.2）*2/2=2.3

后者

title 中不包含查询关键字，没有得分
content 中包含解决方案和 java，假设评分是 2
有得分的 query 数量，这里是 1
总的 query 数量也是 2，最终结果：2*1/2=1

在这种查询中，title 和 content 相当于是相互竞争的关系，所以我们需要找到一个最佳匹配字段。为了解决这一问题，就需要用到 dis_max query（disjunction max query，分离最大化查询）：匹配的文档依然返回，但是只将最佳匹配的评分作为查询的评分。

GET blog/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {
          "match": {
            "title": "java解决方案"
          }
        },
        {
          "match": {
            "content": "java解决方案"
          }
        }
        ]
    }
  }
}

查询结果为：

在 dis_max query 中，还有一个参数 tie_breaker（取值在0～1），在 dis_max query 中，是完全不考虑其他 query 的分数，只是将最佳匹配的字段的评分返回。但是，有的时候，我们又不得不考虑一下其他 query 的分数，此时，可以通过 tie_breaker 来优化 dis_max query。tie_breaker 会将其他 query 的分数，乘以 tie_breaker，然后和分数最高的 query 进行一个综合计算。

function_score query

场景：例如想要搜索附近的餐厅，搜索的关键字是餐厅名字，但是我希望能够将评分较高的餐厅优先展示出来。但是默认的评分策略是没有办法考虑到餐厅评分的，其只是考虑相关性，这个时候可以通过 function_score query 来实现。

PUT book
{
  "mappings": {
    "properties": {
      "name":{
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "votes":{
        "type": "integer"
      }
    }
  }
}

PUT book/_doc/1
{
  "title":"Java并发编程",
  "votes":100
}

PUT book/_doc/2
{
  "title":"Java多线程详解，Java基础",
  "votes":10
}

GET book/_search
{
  "query": {
    "match": {
      "title": "java"
    }
  }
}

查询结果如下：

默认情况下，id 为 2 的记录得分较高，因为他的 title 中包含两个 java。如果我们在查询中，希望能够充分考虑 votes 字段，将 votes 较高的文档优先展示，就可以通过 function_score 来实现。

具体的思路，就是在旧的得分基础上，根据 votes 的数值进行综合运算，重新得出一个新的评分。具体有几种不同的计算方式：

weight
random_score
script_score
field_value_factor

weight

weight 可以对评分设置权重，就是在旧的评分基础上乘以 weight，他其实无法解决我们上面所说的问题。具体用法如下：

GET blog/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {"title": "java"}
      },
      "functions": [{"weight": 10}]
    }
  }
}

执行一下会发现从查询结果可以看到，此时评分会在之前的评分基础上*10

random_score

random_score 会根据 uid 字段进行 hash 运算，生成分数，使用 random_score 时可以配置一个种子，如果不配置，默认使用当前时间。

GET blog/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {"title": "java"}
      },
      "functions": [
        {"random_score": {}}
      ]
    }
  }
}

script_score(重要)

自定义评分脚本。假设每个文档的最终得分是旧的分数加上votes。查询方式如下：

GET blog/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {"title": "java"}
      },
      "functions": [
        {
          "script_score": {
            "script": {
              "lang": "painless",
              "source": "_score + doc['votes'].value"
            }
          }
        }
      ]
    }
  }
}

现在，最终得分是 (oldScore+votes)*oldScore，如果不想乘以 oldScore，查询方式如下：

GET blog/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {"title": "java"}
      },
      "functions": [
        {
          "script_score": {
            "script": {
              "lang": "painless",
              "source": "_score + doc['votes'].value"
            }
          }
        }
      ],
      "boost_mode": "replace"
    }
  }
}

通过 boost_mode 参数，可以设置最终的计算方式。该参数还有其他取值：

multiply：分数相乘
sum：分数相加
avg：求平均数
max：最大分
min：最小分
replace：不进行二次计算

field_value_factor(重要)

这个的功能类似于 script_score，但是不用自己写脚本。假设每个文档的最终得分是旧的分数乘以votes。查询方式如下：

GET blog/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {"title": "java"}
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "votes"
          }
        }
      ]
    }
  }
}

默认的得分就是oldScore*votes，还可以利用 es 内置的函数进行一些更复杂的运算：

GET blog/_search
{
  "query": {
    "function_score": {
      "query": {
				"match": {"title": "java"}
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "votes",
            "modifier": "sqrt"
          }
        }
      ],
      "boost_mode": "replace"
    }
  }
}

此时，最终的得分是（sqrt(votes)）。modifier 中可以设置内置函数，其他的内置函数还有：

参数名	含义
none	默认的，不进行任何计算
log	对字段值取对数
log1p	字段值加1然后取对数
log2p	字段值加2然后取对数
In	取字段值的自然对数
In1p	字段值加1然后取自然对数
In2p	字段值加2然后取自然对数
sqrt	字段值求平方根
square	字段值的平方
reciprocal	倒数

另外还有个参数 factor ，影响因子。字段值先乘以影响因子，然后再进行计算。以 sqrt 为例，计算方式为 sqrt(factor*votes)：

GET blog/_search
{
  "query": {
    "function_score": {
      "query": {
				"match": {"title": "java"}
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "votes",
            "modifier": "sqrt",
            "factor": 10
          }
        }
      ],
      "boost_mode": "replace"
    }
  }
}

还有一个参数 max_boost，控制计算结果的范围：

GET blog/_search
{
  "query": {
    "function_score": {
      "query": {
				"match": {"title": "java"}
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "votes"
          }
        }
      ],
      "boost_mode": "sum",
      "max_boost": 100
    }
  }
}

max_boost 参数表示 functions 模块中，最终的计算结果上限。如果超过上限，就按照上线计算。

boosting query

boosting query 中包含三部分：

positive：得分不变
negative：降低得分
negative_boost：降低的权重

GET books/_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "name": "java"
        }
      },
      "negative": {
        "match": {
          "name": "2008"
        }
      },
      "negative_boost": 0.5
    }
  }
}

执行完以后会发现name中包含2008的权重会*0.5

嵌套查询

嵌套文档

这里需要了解一下nested相关内容，不清楚的可以到这里查看，了解完相关概念后，我们执行如下操作：

PUT movies
{
  "mappings": {
    "properties": {
      "actors":{
        "type": "nested"
      }
    }
  }
}

PUT movies/_doc/1
{
  "name":"霸王别姬",
  "actors":[
    {
      "name":"张国荣",
      "gender":"男"
    },
    {
      "name":"巩俐",
      "gender":"女"
    }
    ]
}

添加完以后我们查看文档数量：

1	GET _cat/indices?v

查看结果如下：

此时你会发现虽然我们就添加一条数据，但是docs.count却是3，这是因为 nested 文档在 es 内部其实也是独立的 lucene 文档，只是在我们查询的时候，es 内部帮我们做了 join 处理，所以最终看起来就像一个独立文档一样。因此这种方案性能并不是特别好。

嵌套查询

这个用来查询嵌套文档：

GET movies/_search
{
  "query": {
    "nested": {
      "path": "actors",
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "actors.name": "张国荣"
              }
            },
            {
              "match": {
                "actors.gender": "男"
              }
            }
          ]
        }
      }
    }
  }
}

父子文档

相比于嵌套文档，父子文档主要有如下优势：

更新父文档时，不会重新索引子文档
创建、修改或者删除子文档时，不会影响父文档或者其他的子文档。
子文档可以作为搜索结果独立返回。

例如学生和班级的关系：

PUT stu_class
{
  "mappings": {
    "properties": {
      "name":{
        "type": "keyword"
      },
      "s_c":{
        "type": "join",
        "relations":{
          "class":"student"
        }
      }
    }
  }
}

s_c 表示父子文档关系的名字，可以自定义。join 表示这是一个父子文档。relations 里边，class 这个位置是 parent名称，student 这个位置是 child名称。接下来，插入两个父文档：

PUT stu_class/_doc/1
{
  "name":"一班",
  "s_c":{
    "name":"class"
  }
}
PUT stu_class/_doc/2
{
  "name":"二班",
  "s_c":{
    "name":"class"
  }
}

再来添加三个子文档：

PUT stu_class/_doc/3?routing=1
{
  "name":"zhangsan",
  "s_c":{
    "name":"student",
    "parent":1
  }
}
PUT stu_class/_doc/4?routing=1
{
  "name":"lisi",
  "s_c":{
    "name":"student",
    "parent":1
  }
}
PUT stu_class/_doc/5?routing=2
{
  "name":"wangwu",
  "s_c":{
    "name":"student",
    "parent":2
  }
}

首先大家可以看到，子文档都是独立的文档。特别需要注意的地方是，子文档需要和父文档在同一个分片上，所以 routing 关键字的值为父文档的 id。另外，name 属性表明这是一个子文档。

父子文档需要注意的地方：

每个索引只能定义一个 join filed
父子文档需要在同一个分片上（查询，修改需要routing）
可以向一个已经存在的 join filed 上新增关系

has_child query

通过子文档查询父文档使用 has_child query。

GET stu_class/_search
{
  "query": {
    "has_child": {
      "type": "student",
      "query": {
        "match": {
          "name": "wangwu"
        }
      }
    }
  }
}

查询 wangwu 所属的班级。

has_parent query

通过父文档查询子文档：

GET stu_class/_search
{
  "query": {
    "has_parent": {
      "parent_type": "class",
      "query": {
        "match": {
          "name": "二班"
        }
      }
    }
  }
}

查询二班的学生。但是大家注意，这种查询没有评分。

可以使用 parent id 查询子文档：

GET stu_class/_search
{
  "query": {
    "parent_id":{
      "type":"student",
      "id":1
    }
  }
}

通过 parent id 查询，默认情况下使用相关性计算分数。

小结

整体上来说：

普通子对象实现一对多，会损失子文档的边界，子对象之间的属性关系丢失。
nested 可以解决第 1 点的问题，但是 nested 有两个缺点：更新主文档的时候要全部更新，不支持子文档属于多个主文档。
父子文档解决 1、2 点的问题，但是它主要适用于写多读少的场景。

地理位置查询

数据准备

创建一个索引：

PUT geo
{
  "mappings": {
    "properties": {
      "name":{
        "type": "keyword"
      },
      "location":{
        "type": "geo_point"
      }
    }
  }
}

准备一个 geo.json 文件：

{"index":{"_index":"geo","_id":1}}
{"name":"西安","location":"34.288991865037524,108.9404296875"}
{"index":{"_index":"geo","_id":2}}
{"name":"北京","location":"39.926588421909436,116.43310546875"}
{"index":{"_index":"geo","_id":3}}
{"name":"上海","location":"31.240985378021307,121.53076171875"}
{"index":{"_index":"geo","_id":4}}
{"name":"天津","location":"39.13006024213511,117.20214843749999"}
{"index":{"_index":"geo","_id":5}}
{"name":"杭州","location":"30.259067203213018,120.21240234375001"}
{"index":{"_index":"geo","_id":6}}
{"name":"武汉","location":"30.581179257386985,114.3017578125"}
{"index":{"_index":"geo","_id":7}}
{"name":"合肥","location":"31.840232667909365,117.20214843749999"}
{"index":{"_index":"geo","_id":8}}
{"name":"重庆","location":"29.592565403314087,106.5673828125"}

最后，执行如下命令，批量导入 geo.json 数据：

1	curl -XPOST "http://localhost:9200/geo/_bulk?pretty" -H "content-type:application/json" --data-binary @geo.json

geo_distance query

给出一个中心点，查询距离该中心点指定范围内的文档：

GET geo/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": [
        {
          "geo_distance": {
            "distance": "600km",
            "location": {
              "lat": 34.288991865037524,
              "lon": 108.9404296875
            }
          }
        }
      ]
    }
  }
}

以(34.288991865037524,108.9404296875) 为圆心，以 600KM 为半径，这个范围内的数据。

geo_bounding_box query

在某一个矩形内的点，通过两个点锁定一个矩形：

GET geo/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": [
        {
          "geo_bounding_box": {
            "location": {
              "top_left": {
                "lat": 32.0639555946604,
                "lon": 118.78967285156249
              },
              "bottom_right": {
                "lat": 29.98824461550903,
                "lon": 122.20642089843749
              }
            }
          }
        }
      ]
    }
  }
}

以南京经纬度作为矩形的左上角，以舟山经纬度作为矩形的右下角，构造出来的矩形中，包含上海和杭州两个城市。

geo_polygon query

在某一个多边形范围内的查询。

GET geo/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": [
        {
          "geo_polygon": {
            "location": {
              "points": [
                {
                  "lat": 31.793755581217674,
                  "lon": 113.8238525390625
                },
                {
                  "lat": 30.007273923504556,
                  "lon":114.224853515625
                },
                {
                  "lat": 30.007273923504556,
                  "lon":114.8345947265625
                }
              ]
            }
          }
        }
      ]
    }
  }
}

给定多个点，由多个点组成的多边形中的数据。

geo_shape query

geo_shape 用来查询图形，针对 geo_shape，两个图形之间的关系有：相交、包含、不相交。

新建索引：

PUT geo_shape
{
  "mappings": {
    "properties": {
      "name":{
        "type": "keyword"
      },
      "location":{
        "type": "geo_shape"
      }
    }
  }
}

然后添加一条线：

PUT geo_shape/_doc/1
{
  "name":"西安-郑州",
  "location":{
    "type":"linestring",
    "coordinates":[
      [108.9404296875,34.279914398549934],
      [113.66455078125,34.768691457552706]
      ]
  }
}

接下来查询某一个图形中是否包含该线：

GET geo_shape/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": [
        {
          "geo_shape": {
            "location": {
              "shape": {
                "type": "envelope",
                "coordinates": [
                  [
            106.5234375,
            36.80928470205937
          ],
          [
            115.33447265625,
            32.24997445586331
          ]
                ]
              },
              "relation": "within"
            }
          }
        }
      ]
    }
  }
}

relation 属性表示两个图形的关系：

within 包含
intersects 相交
disjoint 不相交

特殊查询

more_like_this query

more_like_this query 可以实现基于内容的推荐，给定一篇文章，可以查询出和该文章相似的内容。

GET books/_search
{
  "query": {
    "more_like_this": {
      "fields": [
        "info"
      ],
      "like": "大学战略",
      "min_term_freq": 1,
      "max_query_terms": 12
    }
  }
}

fields：要匹配的字段，可以有多个
like：要匹配的文本
min_term_freq：词项的最低频率，默认是 2。特别注意，这个是指词项在要匹配的文本中的频率，而不是 es 文档中的频率
max_query_terms：query 中包含的最大词项数目
min_doc_freq：最小的文档频率，搜索的词，至少在多少个文档中出现，少于指定数目，该词会被忽略
max_doc_freq：最大文档频率
analyzer：分词器，默认使用字段的分词器
stop_words：停用词列表
minmum_should_match

script query

脚本查询，例如查询所有价格大于 200 的图书：

GET books/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "script": {
            "script": {
              "lang": "painless",
              "source": "if(doc['price'].size()!=0){doc['price'].value > 200}"
            }
          }
        }
      ]
    }
  }
}

percolate query

percolate query 译作渗透查询或者反向查询。

正常操作：根据查询语句找到对应的文档 query->document
percolate query：根据文档，返回与之匹配的查询语句，document->query

应用场景：

价格监控
库存报警
股票警告
…

例如阈值告警，假设指定字段值大于阈值，报警提示。

percolate mapping 定义：

PUT log
{
  "mappings": {
    "properties": {
      "threshold":{
        "type": "long"
      },
      "count":{
        "type": "long"
      },
      "query":{
        "type":"percolator"
      }
    }
  }
}

percolator 类型相当于 keyword、long 以及 integer 等。

插入文档：

PUT log/_doc/1
{
  "threshold":10,
  "query":{
    "bool":{
      "must":{
        "range":{
          "count":{
            "gt":10
          }
        }
      }
    }
  }
}

最后查询：

GET log/_search
{
  "query": {
    "percolate": {
      "field": "query",
      "documents": [
        {
          "count":3
        },
        {
          "count":6
        },
        {
          "count":90
        },
        {
          "count":12
        },
        {
          "count":15
        }
        ]
    }
  }
}

查询结果中会列出不满足条件的文档。

查询结果中的 _percolator_document_slot 字段表示文档的 position，从 0 开始计。

参考内容

参考

参考2