Suppose we are to select a random subset of size m
from a total set of size n
. Since each element in the total set can be identified using a unique index from S = {0, 1, 2, ..., (n - 1)}
. The problem is equivalent to randomly select m
distinct elements from S
.
A trivial algorithm would be repetitively invoking a pseudo-random number generator rand
to generate random numbers from S
. If a number has been generated before, just try again. The algorithm terminates until m
distinct numbers are generated. This algorithm has an optimal space complexity of O(1)
, but may invoke rand
more than m
times.
I'm more concerning about the time complexity instead of space complexity, and would happily trade space for time if reasonable. So I implemented the following algorithm. It invokes rand
exactly min{m, (n - m)}
times, but at the price of an increased space complexity of O(n)
. (original code can be found here)
template <typename Clock = std::chrono::high_resolution_clock>
auto tick_count() {
return Clock::now().time_since_epoch().count();
}
template <typename OutIt, typename RAND = std::minstd_rand,
typename Uint = typename RAND::result_type>
void random_subset(std::size_t m, std::size_t n, OutIt it, RAND&& rand =
RAND(static_cast<Uint>(tick_count()))) {
assert(n - 1 <= rand.max());
assert(m <= n);
if (m == 0) return;
auto swapped = false;
auto tmp = n - m;
if (tmp < m) {
m = tmp;
swapped = true;
}
std::vector<std::size_t> indices(n);
std::iota(indices.begin(), indices.end(), static_cast<std::size_t>(0));
auto back_it = indices.end();
for (std::size_t i = 0; i < m; ++i) {
auto idx = rand() % (n - i);
std::swap(indices[idx], *--back_it);
}
swapped ? std::copy(indices.begin(), back_it, it) :
std::copy(back_it, indices.end(), it);
}
I'm wondering whether the algorithm can be further improved in terms of performance. Improvements to the generic implementation are also welcome.
Aucun commentaire:
Enregistrer un commentaire