vendredi 9 avril 2021

Elasticsearch get max n results with identical field value

I am trying to get a list of random results from an Elasticsearch query, while at most n documents that have the same value for a keyword field (field1).

My approach has been to use a collapse query and collect the inner hits for each outer hit, setting the size per inner hits to n.

This is the query (with n=5):

{
    "query": {
        "function_score": {
            "query": {
                "match": {
                    "field1": "value1"
                }
            },
            "functions": [
                {
                    "random_score": {}
                }
            ]
        }
    },
    "collapse": {
        "field": "field2",
        "inner_hits": {
            "name": "field2_hits",
            "size": 5
        }
    }
}

The results look as expected:

{
 ...
 hits=[
  {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "7d0203f72780da89ce349b69985a070b73adad9de79e65c5a1cb316dd2c00504",
                "_score": 3.3316212,
                "_source": {...},
                "fields": {
                    "field2": [ "value1" ]
                },
                "inner_hits": {
                    "field2_hits": {
                        "hits": {
                            "total": {
                                "value": 10000,
                                "relation": "gte"
                            },
                            "max_score": 3.3304589,
                            "hits": [...]
                        }
           ...}

When collecting the inner hits, I noticed, that the inner hits for different outer hits can be the identical (same id).

In a way, this seems logical, given that any number of outer hits with value X for field2 can be retrieved:

  • hit A (outer hit, field2=X) with inner hits B, C, D.
  • hit E (outer hit, field2=Y) with inner hits E, F, G.
  • hit B (outer hit, field2=X) with inner hits A, C, D.

Hence, hits C and D are duplicated if I collect all the inner hits for each outer hit. For my use case, however, this is not what I want. I suppose I would need only want a single outer hit per field2 value instead, so that all inner hits are unique in the result set.

An additional problem is that setting a large n puts a high load on the ES server, so this might be the wrong approach altogether. My question is thus:

How can get a set of k random documents from an index, where n documents maximum have the same value for a specific field?




Aucun commentaire:

Enregistrer un commentaire