samedi 4 février 2017

C++ random generator with provided (at least estimated) entropy

Using C++ standard random generator I can more or less efficiently create sequences with pre-defined distributions using language-provided tools. What about Shannon entropy? Is it possible some way to define output Shannon entropy for the provided sequence?

I tried a small experiment, generated a long enough sequence with linear distribution, and implemented a Shannon entropy calculator. Resulting value is from 0.0 (absolute order) to 8.0 (absolute chaos)

template <typename T>
double shannon_entropy(T first, T last)
{
    size_t frequencies_count{};
    double entropy = 0.0;

    std::for_each(first, last, [&entropy, &frequencies_count] (auto item) mutable {

        if (0. == item) return;
        double fp_item = static_cast<double>(item);
        entropy += fp_item * log2(fp_item);
        ++frequencies_count;
    });

    if (frequencies_count > 256) {
        return -1.0;
    }

    return -entropy;
}

std::vector<uint8_t> generate_random_sequence(size_t sequence_size)
{
    std::vector<uint8_t> random_sequence;
    std::random_device rnd_device;

    std::cout << "Random device entropy: " << rnd_device.entropy() << '\n';

    std::mt19937 mersenne_engine(rnd_device());
    std::uniform_int_distribution<unsigned> dist(0, 255);

    auto gen = std::bind(dist, mersenne_engine);
    random_sequence.resize(sequence_size);
    std::generate(random_sequence.begin(), random_sequence.end(), gen);
    return std::move(random_sequence);
}

std::vector<double> read_random_probabilities(size_t sequence_size)
{
    std::vector<size_t> bytes_distribution(256);
    std::vector<double> bytes_frequencies(256);

    std::vector<uint8_t> random_sequence = generate_random_sequence(sequence_size);

    size_t rnd_seq_size = random_sequence.size();
    std::for_each(random_sequence.begin(), random_sequence.end(), [&](uint8_t b) mutable {
        ++bytes_distribution[b];
    });

    std::transform(bytes_distribution.begin(), bytes_distribution.end(), bytes_frequencies.begin(),
        [&rnd_seq_size](size_t item) {
        return static_cast<double>(item) / rnd_seq_size;
    });
    return std::move(bytes_frequencies);
}

int main(int argc, char* argv[]) {

    size_t sequence_size = 1024 * 1024;
    std::vector<double> bytes_frequencies = read_random_probabilities(sequence_size);
    double entropy = shannon_entropy(bytes_frequencies.begin(), bytes_frequencies.end());

    std::cout << "Sequence entropy: " << std::setprecision(16) << entropy << std::endl;

    std::cout << "Min possible file size assuming max theoretical compression efficiency:\n";
    std::cout << (entropy * sequence_size) << " in bits\n";
    std::cout << ((entropy * sequence_size) / 8) << " in bytes\n";

    return EXIT_SUCCESS;
}

First, it appears that std::random_device::entropy() hardcoded to return 32; in MSVC 2015 (which is probably 8.0 according to Shannon definition). As you can try it's not far from the truth, this example it's always close to 7.9998..., i.e. absolute chaos.

The working example is on IDEONE (by the way, their compiler hardcode entropy to 0)

One more, the main question - is it possible to create such a generator that generate linearly-distributed sequence with defined entropy, let's say 6.0 to 7.0? Could it be implemented at all, and if yes, if there are some implementations?




Aucun commentaire:

Enregistrer un commentaire