469,286 Members | 2,442 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,286 developers. It's quick & easy.

Huffman Encoding / Decoding

72
The other thread got too long so decided to start a new one.
Got my huffman tree working and am now moving onto encoding and decoding my given text file.

My idea now is to print out all the leaf nodes into a string and split them up using tokens to get my ascii value and the huffman code and store into a collection for reference when encoding.

Just got some questions on this:

1. Is there anyway to look through the huffman tree for the ascii value since they are all now stuck in one node?

2. I was thinking of putting the codes into a map with the huffman code as the key and the ascii value as the map value since it will be sorted by the huffman code which is also determined by the frequency of the ascii value appearing. Thus less time is need to traverse the whole map. Would this be efficient or is there a better way?

Thanks :)
Aug 5 '07 #1
11 6455
JosAH
11,448 Expert 8TB
The other thread got too long so decided to start a new one.
Got my huffman tree working and am now moving onto encoding and decoding my given text file.

My idea now is to print out all the leaf nodes into a string and split them up using tokens to get my ascii value and the huffman code and store into a collection for reference when encoding.

Just got some questions on this:

1. Is there anyway to look through the huffman tree for the ascii value since they are all now stuck in one node?

2. I was thinking of putting the codes into a map with the huffman code as the key and the ascii value as the map value since it will be sorted by the huffman code which is also determined by the frequency of the ascii value appearing. Thus less time is need to traverse the whole map. Would this be efficient or is there a better way?

Thanks :)
I don't know why you're talking about ASCII values. The leaf nodes represent a
character (*any* Unicode character) which is associated with a Huffman code.

A Huffman code for a character simply is a sequence of bits. Instead of printing
the character to the output, you print that bit sequence. That's the essence of
Huffman encoding; it's all about data compression.

You need a mapping from those Unicode characters to the bit sequences for
the compression or encoding part.

When you decompress such a file you should read bits from a bit sequence and
check whether or not you still have that bit sequence associated with a Unicode
character.

For a zero bit read, you take the route of a left child; go right when a 1 bit is
read. The moment you reach a leaf node you can emit the associated character.

kind regards,

Jos
Aug 5 '07 #2
KWSW
72
About the ascii part, the assignment just requires the encoding of a text file thats why I use ascii value. Also the requirement is to print the huffman code as string instead of bits(not too sure why also).

Still not too sure about what you mean by
For a zero bit read, you take the route of a left child; go right when a 1 bit is
read. The moment you reach a leaf node you can emit the associated character.
For now I just created a huffmancode object with the ascii value and its huffman code representation and put them all into an array sorrting with the shortest code first. Thus for compression I just search the ascii value and get the huffman code for it.

Was thinking of doing it the same way but I guess its gonna be very inefficient...
Aug 5 '07 #3
JosAH
11,448 Expert 8TB
About the ascii part, the assignment just requires the encoding of a text file thats why I use ascii value. Also the requirement is to print the huffman code as string instead of bits(not too sure why also).

Still not too sure about what you mean by

For a zero bit read, you take the route of a left child; go right when a 1 bit is
read. The moment you reach a leaf node you can emit the associated character.
For now I just created a huffmancode object with the ascii value and its huffman code representation and put them all into an array sorrting with the shortest code first. Thus for compression I just search the ascii value and get the huffman code for it.
Was thinking of doing it the same way but I guess its gonna be very inefficient...
Suppose you have the following little Huffman tree:
Expand|Select|Wrap|Line Numbers
  1.       .
  2.      / \
  3.     /\  C
  4.    A  B
  5.  
So the encoding is:

A: 00
B: 01
C: 1

Suppose you're reading the folling bits 010011, start at the root of the tree:

0: go left; 1: go right; you're at a leaf, a B was read. Start at the root again;
0: go left; 0: go right; you're at a leaf, an A was read. Start at the root again;
1: go right; you're at a leaf, a C was read. Start at the root again;
1: go right; you're at a lead, a C was read. Start at the root again;
the end of the bit stream was read so you have read BACC

You have to keep that tree somewhere if you want to decode efficiently.
The encoding part can be done with a Map<Character, String> in your case.

kind regards,

Jos
Aug 5 '07 #4
KWSW
72
Hi josAH,

I happen to find ur set of huffman and would like to know it means by your add bit function.

Expand|Select|Wrap|Line Numbers
  1. protected void add(int bit) 
  2.         code&= (1<<len)-1;
  3.         code<<= 1; 
  4.         code|=  bit;
  5.         len++;
  6. }
I understand len++ is to add one to the code length but do not understand the rest. It seems my method of just adding "0" to the left and "1" to the right doesn't give me prefix codes....

edit:

ok i found out these out

&= "bitwise AND and assign"
<<= "shift bits left and assign"
|= "bitwise OR and assign"
<< "shift bits left"
Aug 5 '07 #5
JosAH
11,448 Expert 8TB
Hi josAH,

I happen to find ur set of huffman and would like to know it means by your add bit function.

Expand|Select|Wrap|Line Numbers
  1. protected void add(int bit) 
  2.         code&= (1<<len)-1;
  3.         code<<= 1; 
  4.         code|=  bit;
  5.         len++;
  6. }
Where did you find that code? Not to worry though, I don't claim any copyright.

kind regards,

Jos
Aug 5 '07 #6
KWSW
72
Where did you find that code? Not to worry though, I don't claim any copyright.

kind regards,

Jos
Found it over at the java forums and the user name was the same as urs...

anyway I decided to use the method that prints the bit into string for me to compare and this is what i got


Expand|Select|Wrap|Line Numbers
  1. c 5 00000 0
  2. b 6 010000 10000
  3. w 6 110000 110000
  4. r 4 1000 1000
  5. e 3 100 100
  6. u 5 00010 10
  7. l 5 10010 10010
  8. i 4 1010 1010
  9. s 4 0110 110
  10. S 9 000001110 1110
  11. T 10 0100001110 100001110
  12. I 10 1100001110 1100001110
  13.  
  14.  8 10001110 10001110
  15. k 9 001001110 1001110
  16. j 9 101001110 101001110
  17. G 10 0011001110 11001110
  18. W 10 1011001110 1011001110
  19. : 10 0111001110 111001110
  20. F 10 1111001110 1111001110
  21. , 6 101110 101110
  22. d 5 11110 11110
  23. a 4 0001 1
  24. n 4 1001 1001
  25. o 4 0101 101
  26. g 6 001101 1101
  27. p 6 101101 101101
  28. m 6 011101 11101
  29. M 11 00000111101 111101
  30. z 11 10000111101 10000111101
  31. V 12 001000111101 1000111101
  32. O 12 101000111101 101000111101
  33. ( 13 0011000111101 11000111101
  34. ) 13 1011000111101 1011000111101
  35. ' 13 0111000111101 111000111101
  36. - 13 1111000111101 1111000111101
  37. N 10 0100111101 100111101
  38. ; 10 1100111101 1100111101
  39. x 10 0010111101 10111101
  40. K 13 0001010111101 1010111101
  41. [ 13 1001010111101 1001010111101
  42. ] 13 0101010111101 101010111101
  43. E 13 1101010111101 1101010111101
  44. D 11 11010111101 11010111101
  45. H 9 110111101 110111101
  46. v 7 1111101 1111101
  47. t 4 0011 11
  48. . 8 00001011 1011
  49. A 11 00010001011 10001011
  50. P 11 10010001011 10010001011
  51. q 11 01010001011 1010001011
  52. B 11 11010001011 11010001011
  53. C 10 0110001011 110001011
  54. U 12 001110001011 1110001011
  55. R 12 101110001011 101110001011
  56. J 12 011110001011 11110001011
  57. L 13 0111110001011 111110001011
  58. Y 13 1111110001011 1111110001011
  59. y 7 1001011 1001011
  60. f 6 101011 101011
  61. h 5 11011 11011
  62.   3 111 111
where the first is the character, the 2nd is my huffmancode in string and the 3rd part is the bits after using integer.toBinaryString(int).

I could just modify my code by storing the huffman code in bits and using the integer.toBinaryString(int) to write to file for my assignment requirements but than I will not be learning anything...
Aug 5 '07 #7
JosAH
11,448 Expert 8TB
I could just modify my code by storing the huffman code in bits and using the integer.toBinaryString(int) to write to file for my assignment requirements but than I will not be learning anything...
It seems to look alright to me: the most frequently used letters are associated
with the shortest Huffman codes. As you can see, if the leftmost bit(s) of the
Huffman codes are 0, you can't tell the length of the code when you'd store it
as a simple int. You have to keep track of the bitlength of the code as I did in
my version.

Reading and writing bits from/to a stream is an exercise by itself.

kind regards,

Jos
Aug 5 '07 #8
KWSW
72
It seems to look alright to me: the most frequently used letters are associated
with the shortest Huffman codes. As you can see, if the leftmost bit(s) of the
Huffman codes are 0, you can't tell the length of the code when you'd store it
as a simple int. You have to keep track of the bitlength of the code as I did in
my version.

Reading and writing bits from/to a stream is an exercise by itself.

kind regards,

Jos

So this part has something to do with writing bits to the stream?

Expand|Select|Wrap|Line Numbers
  1. protected void add(int bit) 
  2. {                                                                 
  3.    code &= (1<<len)-1;
  4.    code <<= 1;
  5.    code |=  bit;
  6.    len++;
  7. }
Aug 5 '07 #9
JosAH
11,448 Expert 8TB
Just still clueless on this part:

Expand|Select|Wrap|Line Numbers
  1. protected void add(int bit) 
  2. {                                                                 
  3.    code &= (1<<len)-1;
  4.    code <<= 1;
  5.    code |=  bit;
  6.    len++;
  7. }
It's just bit fiddling (I'm good at that ;-) 'code' is the bit sequence so far and 'len'
is the length of the code so far. The parameter 'bit' is either one or zero. This
little method shifts in that bit on the right side of the code and the length is
incremented. The first line 'code&= (1<<len)-1' isn't actually needed but it makes
sure that only 'len' bits in the code are used before the new bit is shifted in.
Check your Java manuals for those bit manipulating operators.

I could've written 'code= code*2+bit' but I think my mind was in a bit fiddling mood
at that time ;-)

kind regards,

Jos
Aug 5 '07 #10
KWSW
72
It's just bit fiddling (I'm good at that ;-) 'code' is the bit sequence so far and 'len'
is the length of the code so far. The parameter 'bit' is either one or zero. This
little method shifts in that bit on the right side of the code and the length is
incremented. The first line 'code&= (1<<len)-1' isn't actually needed but it makes
sure that only 'len' bits in the code are used before the new bit is shifted in.
Check your Java manuals for those bit manipulating operators.

I could've written 'code= code*2+bit' but I think my mind was in a bit fiddling mood
at that time ;-)

kind regards,

Jos
thanks for explaining it more... makes sense now... managed to find the website from java.sun.com for bitwise operation. Will go through it once I get this assignment done... and I still have not started on hamming yet...

But still kinda stuck with the decoding... understood what you mean by "0" to the left and "1" to the right, but how would I do that since the end result of the treeset is the root node if I understand it correctly...

the logic i was thinking of:

if its "0" go to the left node else go to the right node.
At new node check if its the end of the branch.
If it is then get the symbol.
Else get the next bit and check for left or right node.
Aug 5 '07 #11
JosAH
11,448 Expert 8TB
But still kinda stuck with the decoding... understood what you mean by "0" to the left and "1" to the right, but how would I do that since the end result of the treeset is the root node if I understand it correctly...

the logic i was thinking of:

if its "0" go to the left node else go to the right node.
At new node check if its the end of the branch.
If it is then get the symbol.
Else get the next bit and check for left or right node.
Yep, that's exactly it. After you are ready with the construction phase, you have
that Huffman tree and the decoder must use that tree in the way your described
above.

Huffman compression is an 'off line' compression technique, i.e. first you have
to read the entire tex and build the tree before you can perform any compression
on the text.

Another disadvantage is that not only the compressor needs that tree, the de-
compressor needs it as well. How is the decompressor supposed to find that
tree? The basic answer is that the compressor must include the tree in the
compressed data; as the first data to be read actually because otherwise the
decompressor can't decompress anything.

If you're interested in 'on line' compression techniques read all about the LZW
algorithm.

kind regards,

Jos
Aug 5 '07 #12

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

3 posts views Thread by dot | last post: by
9 posts views Thread by Mark | last post: by
15 posts views Thread by aarklon | last post: by
37 posts views Thread by Zhiv Kurilka | last post: by
reply views Thread by Michele | last post: by
reply views Thread by coucchy | last post: by
reply views Thread by zhoujie | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.