vendredi 23 juin 2017

How to get random documents from Elasticsearch indexes with 50 million documents each

I'd like to sample 2000 random documents from approximately 60 ES indexes holding about 50 million documents each, for a total of about 3 billion documents overall. I've tried doing the following on the Kibana Dev Tools page:

GET some_index_abc_*/_search
{
  ""size": 2000,
  "query": {
    "function_score": {
      "query": {
        "match_phrase": {
          "field_a": "some phrase"
        }
      },
      "random_score": {}
    }
  }
}

But this query never returns. Upon refreshing the Dev Tools page, I get a page that tells me that the ES cluster status is red (doesn't seem to be a coincidence - I've tried several times). Other queries (counts, simple match_all queries) without the random function work fine. I've read that function score queries tend to be slow, but using a random function score is the only method I've been able to find for getting random documents from ES. I'm wondering if there might be any other, faster way that I can sample random documents from multiple large ES indexes.




Aucun commentaire:

Enregistrer un commentaire