I am trying to get a list of random results from an Elasticsearch query, while at most n documents that have the same value for a keyword field (field1).
My approach has been to use a collapse query and collect the inner hits for each outer hit, setting the size per inner hits to n.
This is the query (with n=5):
{
"query": {
"function_score": {
"query": {
"match": {
"field1": "value1"
}
},
"functions": [
{
"random_score": {}
}
]
}
},
"collapse": {
"field": "field2",
"inner_hits": {
"name": "field2_hits",
"size": 5
}
}
}
The results look as expected:
{
...
hits=[
{
"_index": "my_index",
"_type": "_doc",
"_id": "7d0203f72780da89ce349b69985a070b73adad9de79e65c5a1cb316dd2c00504",
"_score": 3.3316212,
"_source": {...},
"fields": {
"field2": [ "value1" ]
},
"inner_hits": {
"field2_hits": {
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": 3.3304589,
"hits": [...]
}
...}
When collecting the inner hits, I noticed, that the inner hits for different outer hits can be the identical (same id).
In a way, this seems logical, given that any number of outer hits with value X for field2 can be retrieved:
- hit A (outer hit,
field2=X) with inner hits B, C, D. - hit E (outer hit,
field2=Y) with inner hits E, F, G. - hit B (outer hit,
field2=X) with inner hits A, C, D.
Hence, hits C and D are duplicated if I collect all the inner hits for each outer hit. For my use case, however, this is not what I want. I suppose I would need only want a single outer hit per field2 value instead, so that all inner hits are unique in the result set.
An additional problem is that setting a large n puts a high load on the ES server, so this might be the wrong approach altogether. My question is thus:
How can get a set of k random documents from an index, where n documents maximum have the same value for a specific field?
Aucun commentaire:
Enregistrer un commentaire