I am working on a compression related project, and I would like some test data that is both repetitive and compresses poorly with LZMA (particularly the Python implementation).
I am aware that many algorithms, LZMA especially, in a broad sense attack repetition in the data and in a narrow sense rely on Markov chains. Therefore my two criteria are at odds. I am hoping that there are some well known corner cases that LZMA is known to stumble on.
Where can I find or generate a file that is about 10 kb uncompressed, is in some sense repetitive or self-similar (as judged subjectively by a human), and doesn't become trivially tiny when compressed?
For instance a 10 kb string of zeros is very repetitive, but when compressed with 7z I end up with only 150 bytes, which is too small for my testing purposes. I'd like the compressed file to be 1 kb or more.
Aucun commentaire:
Enregistrer un commentaire