I have an RDD of items, and a function d: (Item, Item) => Double
that computes the distance between two items. I am trying to compute the average distance between items drawn at random from the RDD. The RDD is fairly large (100s of millions), so computing the exact average is out of the question.
Therefore I would like to get an RDD of sampled pairs of items (from which I would compute the distances). For example, I want to get a sample of 100m pairs. Given the RDD of sampled pairs, I would then compute the average, histogram etc. in order to understand the distance distribution.
Here are the initial attempts which have all failed:
-
Generate two RDDs using
.sample
, zip them and compute the distance between items. This fails since.zip
requires both RDDs to have the exact same number of items per partition. -
Use
.cartesian
of the RDD with itself, and then.sample
. This fails (out of memory) since apparentlycartesian
is not meant to be used this way. -
collect two small samples of the RDD, and
.zip
the two arrays. This works fine but it doesn't scale.
Any ideas?
Thanks!
Aucun commentaire:
Enregistrer un commentaire