the notes of ZIP workshop 1-4, compression and decompression with deflate and fixed Huffman
introduction
this entry follows http://d.hatena.ne.jp/takahirox/20110723/1311406642
this is the notes of ZIP workshop.
- the blog of the host is here. you can get handout.
- atnd
yeah, it's been a long time since i attended the workshop. :P
abstract
i implemented a decompression program and compression program with deflate and fixed Huffman. deflate is a compression algorithm used in zip.
i'll introduce them here.
source code
you can get the source codes of them, here.
- http://code.google.com/p/ta-binary-edit/source/browse/decompress.cpp
- http://code.google.com/p/ta-binary-edit/source/browse/compress.cpp
note that they aren't sophisticated yet. :P
decompression
i first tried implementing a decompression program.
because it's easier to implement than compression one. i think basically there in one way to achieve the decompression logic, while there're various ways to achieve the compression one. and the logic of the compression affects the compressibility and the processing speed.
experiment
first, i made a compressed data by deflate with python.
% cat compress.py import zlib import sys import os f = os.fdopen( 1, 'wb', 0 ) f.write( zlib.compress( sys.argv[1] )[2:-4] ) % python compress.py hogehoge > compressed.dat % od -tx1 compressed.dat 0000000 cb c8 4f 4f cd 00 62 00 0000010
and then, i decompressed it with the program that i implemented.
% g++ -o decompress decompress.cpp
% ./decompress compressed.dat
hogehoge
i made it!
but still i have a problem with the program. it can't work correctly in the case a compressed data includes multi-byte characters. i'm still not sure why it doesn't work...
compression
and next, i tried implementing a compressing program.
as i said, there're various logics to achieve it. so provisionally, i decided a simple (and dull) logic.
experiment
first, i compressed plane text.
% echo -n hogehoge > plane.dat % g++ -o compress compress.cpp % ./compress plane.dat > compressed.dat % od -tx1 compressed.dat 0000000 cb c8 4f 4f 05 61 00 00 0000010
it seemed to make it.
and then, i decompressed it with both my program and python to confirm it succeeded.
% ./decompress compressed.dat hogehoge % cat decompress.py import zlib import sys print zlib.decompress( file( sys.argv[1], 'rb' ).read( ), -8 ) % python decompress.py compressed.dat hogehoge
i made it!
but i have still problems with the program as well.
one is that it can't work correctly sometime. i just only found that it depends on handled data, but still not sure the details.
the other one is that compressibility is not quite there.
for example, i think the ideal result for compressing "hellohellohello" is "hello(10,5)", where (a,b) means that a is a length and b is a distance.
but the result of the program i made is "hello(5,5)(10,10)". so, there is a room to improve my logic.
conclusion
i achieved a decompressing program and a compressing program with deflate and fixed Huffman, though i still have some problems.
i still have a lot of things what i want to do. i'm going to do them in the following order.
- to solve the problems
- to use dynamic Huffman
- to improve compressibility and processing speed
- to embedded the decompressing program into the zip analysis tool that i made before
- to embedded the compressing program into the zip archive tool that i made before