mardi 1 novembre 2016

Repetitive but poorly compressing data

I am working on a compression related project, and I would like some test data that is both repetitive and compresses poorly with LZMA (particularly the Python implementation).

I am aware that many algorithms, LZMA especially, in a broad sense attack repetition in the data and in a narrow sense rely on Markov chains. Therefore my two criteria are at odds. I am hoping that there are some well known corner cases that LZMA is known to stumble on.

Where can I find or generate a file that is about 10 kb uncompressed, is in some sense repetitive or self-similar (as judged subjectively by a human), and doesn't become trivially tiny when compressed?

For instance a 10 kb string of zeros is very repetitive, but when compressed with 7z I end up with only 150 bytes, which is too small for my testing purposes. I'd like the compressed file to be 1 kb or more.




Aucun commentaire:

Enregistrer un commentaire