473,289 Members | 1,848 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,289 software developers and data experts.

Re: compressing short strings?

Helmut Jarausch:
I'd ask in comp.compression where the specialists are listening and who are
very helpful.
Asking in comp.compression is a good starting point.

My suggestions (sorry if they look a bit unsorted): it depends on what
language you want to use, how much you want to compress the strings,
if they are ASCII or unicode, how much quickly you want to compress/
decompress them, and how much time you have to write the encoder/
decoder.

You can use Python or a C/D extension, if you use C you will need more
time to develop it, but you will be quite more free to use any
algorithm you want (well, you are free with Python too, but the code
may be slower if it performs low level things), so D may be an
acceptable compromise.

Generally it's better to start with the simplest solution, and go on
from there looking for more complex ones if the first one isn't
enough. In Python the simplest thing to try is to compress the strings
with zip:
"George Washington".encode("zip")
Probably the result is longer than the original (25 bytes output for
this string). With "bz2" the results are probably worse (56 bytes for
this string).

A possible alternative it to compress many (10-30) strings at a time,
using an external table of their starting points (you can put the
table at the beginning), or you can separate those strings with a
separator char that's never present in the strings; and then you can
zip that group. If you have 1 million strings you can divide them into
such little groups, but the indexing in the DB becomes complex, and I
don't like this solution much.

A bit less simple python solution is to keep a fixed corpus of text
and compress it with and without the little string you want to add
(like you say), but compressing/decompressing everything each time (if
the corpus is small enough this isn't that slow), for example:
>>s2 = a 2488 bytes long text
len(s2)
2488
>>z1 = (s2).encode("zip")
len(z1)
1321
>>s = 'George Washington'
len((s).encode("zip"))
25
>>len(s.encode("bz2"))
56
>>z2 = (s2+s).encode("zip")
len(z2)
1332
>>len(z2) - len(z1)
11

So you need to store only this 11 byte long string to be able to
decompress it. A bigger corpus (s2) allows you a bit more compression,
but it requires more time for compression/decompression:
>>len(s3) # s3 is a longer English text
10811
>>z3 = s3.encode("zip")
len(z3)
4971
>>z4 = (s3+s).encode("zip")
len(z4)
4980
>>len(z4) - len(z3)
9

One of the simpler ways to compress short ASCII English texts is to
see that the most frequent chars are very few, so you can encode the
15 most common English chars with 1 nibble, and the other ones with 3
nibbles (first nibble is an escape, and the two following are the
whole byte), if you use a low level language you can encode this with
few lines of code, it's very fast and it may shave off 25-33% of your
bytes. It means that your "George Washington" (17 chars long) may
become about 12-13 chars long. I have implemented this with Python
too, it's easy.

Another possible solution is to use a Huffman compression with fixed
symbol frequencies. In Python there are ways to write such compressor/
decompressor in just 10-15 lines code. You can use the heapq to speed
up the processing. This may lead to more compression, but I presume
not much than less than 4-5 bpc. But if you want to compress/
decompress a LOT of strings you may need to write (or look for the
code) in C/D.

This is one implementation that doesn't use heapq yet:

from operator import itemgetter

def clean_tree(obj):

if isinstance(obj, tuple):

return obj[0]

else:

return [clean_tree(obj[0][0]), clean_tree(obj[0][1])]

def huffman_generate(freqsl):

while len(freqsl) 2:

freqsl.sort(key=itemgetter(1))

freqsl = [ [freqsl[0:2], freqsl[0][1] + freqsl[1][1]] ] +
freqsl[2:]

return [freqsl, -1]

def tree2dict(tree):

def encode(tree, out=""):

if isinstance(tree, list):

encode(tree[0], out+"0")

encode(tree[1], out+"1")

else:

dic[tree[0]] = out

return dic

dic = {}

return encode(tree)

freqs = [('a', 45), ('b', 13), ('c', 12), ('d', 16), ('e', 9), ('f',
5)]

print tree2dict(clean_tree(huffman_generate(freqs)))

If you want to use a Huffman you can use the usual tricks to increase
compression: use 2-char symbols, or even use whole words as symbols
(but you can use whole words only if you are sure your language is
English with few strange words).

If you want more compression, like a 6 bytes (2.8 bpc) output you will
need more complex algorithms. On long English texts the best
compressors need less than 0.9 bpc, but they are surely useless for
you. I don't think you will be able to go down to 6 bytes for that
string, unless you are ready to pay lot of computational time :-)

Bye,
bearophile
Jun 27 '08 #1
2 3185
bearophile:
So you need to store only this 11 byte long string to be able to
decompress it.
Note that maybe there is a header, that may contain changing things,
like the length of the compressed text, etc.

Bye,
bearophile
Jun 27 '08 #2
On May 20, 8:24*am, bearophileH...@lycos.com wrote:
bearophile:
So you need to store only this 11 byte long string to be able to
decompress it.

Note that maybe there is a header, that may contain changing things,
like the length of the compressed text, etc.

Bye,
bearophile
I've read that military texts contain different letter frequencies
than standard English. If you use a non-QWERTY keyset, it may change
your frequency distribution also.

I worry that this is an impolite question; as such, I lean to peers
for backup:

Will you be additionally adding further entries to the zipped list?

Will you be rewriting the entire file upon update, or just appending
bytes?

If your sequence is 'ab, ab, ab, cd, ab', you might be at:

00010.

Add 'cd' again and you're at:

000101.

You didn't have to re-output the contents.

But, if you add 'bc', you have:

0001012, which isn't in binary. So you're left at:

000 000 000 001 000 001 010

But five more and the byte overflows.

I'd say pickle the corpus, with new additions, and re-zip the entire
contents each time. Would you like to break across
(coughdisksectorscough) multiple files, say, a different corpus-
compression file pair for every thousand entries?
Jun 27 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Junkmail | last post by:
I have an application with highly compressable strings (gzip encoding usually does somewhere between 20-50X reduction.) My base 350MB database is mostly made up of these slowly (or even static)...
10
by: Dennis Farr | last post by:
It has been suggested that rather than convert an already large flat file, with many similar rows, to XML, some type of header be attached to the file, containing some sort of meta-XML description...
0
by: Yuancai \(Charlie\) Ye | last post by:
Hi, All: I am happy to annouce that we have formally released our latest SocketPro version 4 at www.udaparts.com, an advanced remoting framework written from batching/queue, asynchrony and...
1
by: john smith | last post by:
Hi, In my .NET application I save data in an XML file. When opening a project from an XML file I parse the file and recreate objects, set settings etc. The problem is that the XML file can be...
4
by: sri2097 | last post by:
Hi all,This is in reply to the 'Compressing folders in Windows using Python' query I raised y'day. I figured out that windows does not allow command line zipping so I started looking for...
6
by: Nuno Magalhaes | last post by:
Is there any way to encode a byte stream of audio data into GSM or other codec data, and the decode also? I searched all over the Internet and didn't find any valid solution applying to C#. I'm...
6
by: iwdu15 | last post by:
hi, how can i convert type Short to Unicode strings? the System.Text.UnicodeEncoding cant convert it if the length is too shrot....so i need to do this manually. how can i do this -- -iwdu15
2
by: Beliavsky | last post by:
How can I replace multiple consecutive spaces in a file with a single character (usually a space, but maybe a comma if converting to a CSV file)? Ideally, the Python program would not compress...
3
by: yovalk | last post by:
Hello, I store long strings in the client side and wish to have tham somehow compressed. Anyone knows a method to javascript to compress/decompress a string? Thanks, Yuval
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.