dimanche 18 décembre 2016

Apache spark: sample RDD of pairs

I have an RDD of items, and a function d: (Item, Item) => Double that computes the distance between two items. I am trying to compute the average distance between items drawn at random from the RDD. The RDD is fairly large (100s of millions), so computing the exact average is out of the question.

Therefore I would like to get an RDD of sampled pairs of items (from which I would compute the distances). For example, I want to get a sample of 100m pairs. Given the RDD of sampled pairs, I would then compute the average, histogram etc. in order to understand the distance distribution.

Here are the initial attempts which have all failed:

  1. Generate two RDDs using .sample, zip them and compute the distance between items. This fails since .zip requires both RDDs to have the exact same number of items per partition.

  2. Use .cartesian of the RDD with itself, and then .sample. This fails (out of memory) since apparently cartesian is not meant to be used this way.

  3. collect two small samples of the RDD, and .zip the two arrays. This works fine but it doesn't scale.

Any ideas?

Thanks!




Aucun commentaire:

Enregistrer un commentaire