the notes of ZIP workshop 1-4, compression and decompression with deflate and fixed Huffman

introduction

this entry follows http://d.hatena.ne.jp/takahirox/20110723/1311406642

this is the notes of ZIP workshop.

the blog of the host is here. you can get handout.
- http://d.hatena.ne.jp/n7shi/20110529
atnd
- http://atnd.org/events/14792

yeah, it's been a long time since i attended the workshop. :P

abstract

i implemented a decompression program and compression program with deflate and fixed Huffman. deflate is a compression algorithm used in zip.

i'll introduce them here.

source code

you can get the source codes of them, here.

note that they aren't sophisticated yet. :P

decompression

i first tried implementing a decompression program.

because it's easier to implement than compression one. i think basically there in one way to achieve the decompression logic, while there're various ways to achieve the compression one. and the logic of the compression affects the compressibility and the processing speed.

logic

the flow of decompressing logic of the program is here.

experiment

first, i made a compressed data by deflate with python.

% cat compress.py
import zlib
import sys
import os

f = os.fdopen( 1, 'wb', 0 )
f.write( zlib.compress( sys.argv[1] )[2:-4] )

% python compress.py hogehoge > compressed.dat
% od -tx1 compressed.dat
0000000 cb c8 4f 4f cd 00 62 00
0000010

and then, i decompressed it with the program that i implemented.

% g++ -o decompress decompress.cpp
% ./decompress compressed.dat
hogehoge

i made it!

but still i have a problem with the program. it can't work correctly in the case a compressed data includes multi-byte characters. i'm still not sure why it doesn't work...

compression

and next, i tried implementing a compressing program.

as i said, there're various logics to achieve it. so provisionally, i decided a simple (and dull) logic.

logic

the flow of compressing logic of the program is here.

experiment

first, i compressed plane text.

% echo -n hogehoge > plane.dat
% g++ -o compress compress.cpp
% ./compress plane.dat > compressed.dat
% od -tx1 compressed.dat
0000000 cb c8 4f 4f 05 61 00 00
0000010

it seemed to make it.

and then, i decompressed it with both my program and python to confirm it succeeded.

% ./decompress compressed.dat
hogehoge

% cat decompress.py
import zlib
import sys

print zlib.decompress( file( sys.argv[1], 'rb' ).read( ), -8 )

% python decompress.py compressed.dat
hogehoge

i made it!

but i have still problems with the program as well.

one is that it can't work correctly sometime. i just only found that it depends on handled data, but still not sure the details.

the other one is that compressibility is not quite there.

for example, i think the ideal result for compressing "hellohellohello" is "hello(10,5)", where (a,b) means that a is a length and b is a distance.

but the result of the program i made is "hello(5,5)(10,10)". so, there is a room to improve my logic.

evaluating by hand

and the worst problem is that processing speed is very slow.

that's because this logic checks almost all data by byte.

and i evaluated it in worst case by hand. under 32KB data(32KB is queue size) O(x) = x^2, after that it increases by O(x) = x.

how can i make it faster?

conclusion

i achieved a decompressing program and a compressing program with deflate and fixed Huffman, though i still have some problems.

i still have a lot of things what i want to do. i'm going to do them in the following order.

to solve the problems
to use dynamic Huffman
to improve compressibility and processing speed
to embedded the decompressing program into the zip analysis tool that i made before
to embedded the compressing program into the zip archive tool that i made before