473,387 Members | 1,687 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

simple string compression?

I'm looking for some kind of simple string compression code I can use. I'm
not looking for SharpZipLib. Their implimentation spans several classes and
is very complex. I'm just looking for something simple. A single class,
preferrably. Does such a thing exist? Thanks,

Chris
Nov 15 '05 #1
20 8764
Hi Chris,

What kind of strings are you talking about here? What's in them and how
long are they? What strength of compression do you want? (yes, ideally down to
1 byte, lol, but what would be reasonable)

Regards,
Fergus

|| I'm looking for some kind of simple string compression code I can
|| use. I'm not looking for SharpZipLib. Their implimentation spans
|| several classes and is very complex. I'm just looking for something
|| simple. A single class, preferrably. Does such a thing exist?
|| Thanks,
|| Chris

Nov 15 '05 #2
Hi Chris,

What kind of strings are you talking about here? What's in them and how
long are they? What strength of compression do you want? (yes, ideally down to
1 byte, lol, but what would be reasonable)

Regards,
Fergus

|| I'm looking for some kind of simple string compression code I can
|| use. I'm not looking for SharpZipLib. Their implimentation spans
|| several classes and is very complex. I'm just looking for something
|| simple. A single class, preferrably. Does such a thing exist?
|| Thanks,
|| Chris

Nov 15 '05 #3
the files being compressed will be about 300 to 600k. The files are plain
text, and consist mostly of spaces, so they compress extremely well. I just
tried one, normal zip compression brought the file from 599,111 to 54,337
bytes. That's 9% of the original file size. I'm looking for something
(anything) that compresses the text by at least 50%.

Chris

"Fergus Cooney" <fi******@tesco.net> wrote in message
news:uo****************@TK2MSFTNGP09.phx.gbl...
Hi Chris,

What kind of strings are you talking about here? What's in them and how long are they? What strength of compression do you want? (yes, ideally down to 1 byte, lol, but what would be reasonable)

Regards,
Fergus

|| I'm looking for some kind of simple string compression code I can
|| use. I'm not looking for SharpZipLib. Their implimentation spans
|| several classes and is very complex. I'm just looking for something || simple. A single class, preferrably. Does such a thing exist?
|| Thanks,
|| Chris

Nov 15 '05 #4
the files being compressed will be about 300 to 600k. The files are plain
text, and consist mostly of spaces, so they compress extremely well. I just
tried one, normal zip compression brought the file from 599,111 to 54,337
bytes. That's 9% of the original file size. I'm looking for something
(anything) that compresses the text by at least 50%.

Chris

"Fergus Cooney" <fi******@tesco.net> wrote in message
news:uo****************@TK2MSFTNGP09.phx.gbl...
Hi Chris,

What kind of strings are you talking about here? What's in them and how long are they? What strength of compression do you want? (yes, ideally down to 1 byte, lol, but what would be reasonable)

Regards,
Fergus

|| I'm looking for some kind of simple string compression code I can
|| use. I'm not looking for SharpZipLib. Their implimentation spans
|| several classes and is very complex. I'm just looking for something || simple. A single class, preferrably. Does such a thing exist?
|| Thanks,
|| Chris

Nov 15 '05 #5
Chris LaJoie <ch***@etriptrader.com> wrote:
the files being compressed will be about 300 to 600k. The files are plain
text, and consist mostly of spaces, so they compress extremely well. I just
tried one, normal zip compression brought the file from 599,111 to 54,337
bytes. That's 9% of the original file size. I'm looking for something
(anything) that compresses the text by at least 50%.


Could you say exactly why it matters that SharpZipLib is several
classes? The reason that's important might constrain other solutions.
Do you have a sample file that you could put on a website somewhere for
us to test potential solutions?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #6
Chris LaJoie <ch***@etriptrader.com> wrote:
the files being compressed will be about 300 to 600k. The files are plain
text, and consist mostly of spaces, so they compress extremely well. I just
tried one, normal zip compression brought the file from 599,111 to 54,337
bytes. That's 9% of the original file size. I'm looking for something
(anything) that compresses the text by at least 50%.


Could you say exactly why it matters that SharpZipLib is several
classes? The reason that's important might constrain other solutions.
Do you have a sample file that you could put on a website somewhere for
us to test potential solutions?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #7
Hi Chris,

If over 50% of the file is spaces, run-length-encoding is an extremely
simple method (ie. you could do it in a single method, let alone a single
class!).

The files also sound as if they would be amenable to Huffman coding. This
is more complicated but still pretty straightforward.

Would you like further help with this?

Regards,
Fergus
Nov 15 '05 #8
Hi Jon,

|| Do you have a sample file that you could put on a
|| website somewhere for us to test potential solutions?

Yes, a few of your files in a zip for us to play with! :-)

Regards,
Fergus
Nov 15 '05 #9
Here is the file I'm using to test:
http://www.etriptrader.com/ual/remot...sep/ORDATD.300

Also, I found exactly what I was looking for here:
http://www.chipstips.info/showtopic.asp?topic=cshacc

the compressed text was 42% of the original size. Not good, but it will do.

Chris

"Fergus Cooney" <fi******@tesco.net> wrote in message
news:uK**************@TK2MSFTNGP09.phx.gbl...
Hi Jon,

|| Do you have a sample file that you could put on a
|| website somewhere for us to test potential solutions?

Yes, a few of your files in a zip for us to play with! :-)

Regards,
Fergus

Nov 15 '05 #10
Hi Chris,

http://www.etriptrader.com/ual/remot...sep/ORDATD.300

The link seems to download a blank file. :-(

Regards,
Fergus
Nov 15 '05 #11
Fergus Cooney <fi******@tesco.net> wrote:
If over 50% of the file is spaces, run-length-encoding is an extremely
simple method (ie. you could do it in a single method, let alone a single
class!).

The files also sound as if they would be amenable to Huffman coding. This
is more complicated but still pretty straightforward.

Would you like further help with this?


I've been thinking about this a bit, and I reckon it would be quite
interesting as it's *text only* to work it out as a
System.Text.Encoding class - that way it would be really simple to use,
for one thing.

Of course, if we knew the type of text which would need to be encoded,
that would help - if it were all ASCII, for instance, that increases
the options somewhat. (As a *very* simple example, you could store any
non-single spaces as a byte with the top bit set and the rest of the
byte being the number of spaces (2-129 encoded as 0-127) and that would
do quite nicely.)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #12
Chris LaJoie <ch***@etriptrader.com> wrote:
Here is the file I'm using to test:
http://www.etriptrader.com/ual/remot...sep/ORDATD.300
Like Fergus, I couldn't get any useful data out of that.
Also, I found exactly what I was looking for here:
http://www.chipstips.info/showtopic.asp?topic=cshacc

the compressed text was 42% of the original size. Not good, but it will do.


Interesting that it returns it as a string - I'd anticipated that you'd
want it converting to a byte array. (One thing that's nasty about the
above is that it seems to believe that ASCII is 8 bit rather than 7 bit
- I don't believe it *is* actually compatible in environments that only
deal with ASCII.)

If you're still interested in seeing whether there are ways of getting
it down lower than that, feel free to mail me a couple of sample files
directly - if you only need the compressor to be able to compress real
ASCII, I'll try the scheme I proposed in my last post, just to see how
it does.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #13
> Like Fergus, I couldn't get any useful data out of that.

My appologies, the link now works.
If you're still interested in seeing whether there are ways of getting
it down lower than that, feel free to mail me a couple of sample files
directly - if you only need the compressor to be able to compress real
ASCII, I'll try the scheme I proposed in my last post, just to see how
it does.


Sure, if you think you can, I'd certainly be interrested in seeing what you
come up with. Even if it doesn't produce better compression ratios.
Thanks,

Chris
Nov 15 '05 #14
Chris LaJoie <ch***@etriptrader.com> wrote:
Like Fergus, I couldn't get any useful data out of that.


My appologies, the link now works.
If you're still interested in seeing whether there are ways of getting
it down lower than that, feel free to mail me a couple of sample files
directly - if you only need the compressor to be able to compress real
ASCII, I'll try the scheme I proposed in my last post, just to see how
it does.


Sure, if you think you can, I'd certainly be interrested in seeing what you
come up with. Even if it doesn't produce better compression ratios.


Well, using my "space saver" technique got the 519,942 byte input file
down to 217,875 bytes - again, 42%. I suspect that getting it down
significantly from there would take a fair amount more work -
basically, writing a Huffman encoder, which is reasonably
straightforward (I've done it before) but not trivial, and not
something I fancy doing at half four in the morning :)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #15
> Well, using my "space saver" technique got the 519,942 byte input file
down to 217,875 bytes - again, 42%. I suspect that getting it down
significantly from there would take a fair amount more work -
basically, writing a Huffman encoder, which is reasonably
straightforward (I've done it before) but not trivial, and not
something I fancy doing at half four in the morning :)


Sounds good Jon. I'm anxious to see what you can do with that. If you
don't mind, I plan on submitting this code (with proper credit of course) to
various sites so compression code isn't so hard to come by.

Chris
Nov 15 '05 #16
Chris LaJoie <ch***@etriptrader.com> wrote:
Well, using my "space saver" technique got the 519,942 byte input file
down to 217,875 bytes - again, 42%. I suspect that getting it down
significantly from there would take a fair amount more work -
basically, writing a Huffman encoder, which is reasonably
straightforward (I've done it before) but not trivial, and not
something I fancy doing at half four in the morning :)


Sounds good Jon. I'm anxious to see what you can do with that. If you
don't mind, I plan on submitting this code (with proper credit of course) to
various sites so compression code isn't so hard to come by.


Sure. This is really only "early morning for fun" code though - no
comments, no error checking, hideously inefficient etc - I just wanted
to see how it would do.

Also, due to the fact that it *can* compress so well, the
GetMaxCharCount may well mean that some other classes will seriously
overallocate space.

So, with all the caveats over with, here's the code.

using System;
using System.Text;
using System.IO;

public class SpaceSaver : Encoding
{
public override int GetByteCount (char[] text, int start,
int length)
{
return ToBinary(text, start, length).Length;
}

public override int GetBytes (char[] text, int textStart,
int length, byte[] buffer,
int bufferStart)
{
byte[] binary = ToBinary(text, textStart, length);
Array.Copy (binary, 0, buffer, bufferStart, binary.Length);
return binary.Length;
}

public override int GetCharCount (byte[] binary, int start,
int length)
{
return ToString(binary, start, length).Length;
}

public override int GetChars (byte[] binary, int binaryStart,
int length, char[] buffer,
int bufferStart)
{
char[] chars = ToString (binary, binaryStart,
length).ToCharArray();
Array.Copy (chars, 0, buffer, bufferStart, chars.Length);
return chars.Length;
}

public override int GetMaxByteCount(int length)
{
return length;
}

public override int GetMaxCharCount(int length)
{
return length*129;
}

String ToString (byte[] data, int start, int length)
{
StringBuilder builder = new StringBuilder();
foreach (byte b in data)
{
if (b < 128)
builder.Append ((char)b);
else
builder.Append (new string (' ', b-126));
}
return builder.ToString();
}

byte[] ToBinary (char[] text, int start, int length)
{
using (MemoryStream ms = new MemoryStream())
{
int spaces=0;
foreach (char t in text)
{
if (t==' ')
{
spaces++;
if (spaces==130)
{
ms.WriteByte(255);
spaces=1;
}
}
else
{
switch (spaces)
{
case 0:
break;
case 1:
ms.WriteByte(32);
break;
default:
ms.WriteByte((byte)(spaces+126));
break;
}
spaces=0;
ms.WriteByte ((byte)t);
}
}
switch (spaces)
{
case 0:
break;
case 1:
ms.WriteByte(32);
break;
default:
ms.WriteByte((byte)(spaces+126));
break;
}
ms.Flush();
return ms.ToArray();
}
}

static void Main(string[] args)
{
string inputFile = args[0];

Encoding encoding = new SpaceSaver();

try
{
// Create the reader and writer with appropriate encodings.
using (StreamReader inputReader =
new StreamReader (inputFile, Encoding.ASCII))
{
string allText = inputReader.ReadToEnd();

byte[] data = encoding.GetBytes(allText);

Console.WriteLine ("data length="+data.Length);

string recovered = encoding.GetString(data);
if (recovered != allText)
{
Console.WriteLine ("Oops!");
Console.WriteLine (recovered.Substring (0, 100));
}
}
}
catch (IOException e)
{
Console.WriteLine ("Exception during processing: {0}",
e.Message);
}
}
}
--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #17
> Also, due to the fact that it *can* compress so well, the
GetMaxCharCount may well mean that some other classes will seriously
overallocate space.


Thanks a lot for the code. A quick test proved it works, and gets a
compression ratio of about 41% of the original file size. Good job :)
I'll go into the code more tommorow, but I may have to mail you if I can't
figure out what's going on.

Chris
Nov 15 '05 #18
Hi Jon,

Very simple, very effective - nice routine. :-)

I don't know if you looked at the reference that Chris gave - the HACC
op-code method. It's pleasing that yours achieves a comparable result without
the complexity (not that it's that much more). It does, of course, depend on
what the data is.

I had a play with it and, noticing repeated sequences of '---', turned it
from a space-saver to a true RLE encoder. There weren't enough dashes to
overcome the loss of using the high-bit, however, and it added an extra 10K. I
think that some more could be squeezed out by allowing it to be a two-char
saver and using 0x40 as the switch. Reduced run lengths, but it would allow
for a second favourite character, ie the '-'. Not worth the effort though.

Like you say, Jon, the next (decent) level of compression is going to come
with a lot more complexity.
GetMaxCharCount(). Oh boy. Lol. 'Nuff said !! :-))
So too GetBytes and GetChars - oh, a copying we will go....
But, no worries, I know this was just an experiment of implementing an
Encoding.
Chris, when you submit Jon's code, can I recommend that you make some
small changes. A key point in this method is the use of the top bit, so I
think this should be emphasised.

Instead of
builder.Append (new string (' ', b-126));
have
builder.Append (new string (' ', (b+2)-128));

Instead of
ms.WriteByte((byte)(spaces+126));
have
ms.WriteByte((byte)(128 + (spaces-2)));

And to show that this is in the same scheme:
if (t==' ')
{
spaces++;
if (spaces==130)
{
ms.WriteByte(255);
spaces=1;
}
Have
if (t==' ')
{
if (spaces==129)
{
ms.WriteByte(128 + (spaces -2));
spaces=0;
spaces++;
}

Thus there is a consistency in the use of +/-2 and +/-128. Maybe you'd
even go as far as having a constant for the compressed-spaces indicator, ie
the top bit.

Regards,
Fergus

ps. And now I know how an Encoding looks from the inside. :-)
Nov 15 '05 #19
Fergus Cooney <fi******@tesco.net> wrote:
Very simple, very effective - nice routine. :-)
Cheers :)
I don't know if you looked at the reference that Chris gave - the HACC
op-code method. It's pleasing that yours achieves a comparable result without
the complexity (not that it's that much more). It does, of course, depend on
what the data is.
I looked at it briefly, but the description wasn't terribly well
written unfortunately...
I had a play with it and, noticing repeated sequences of '---', turned it
from a space-saver to a true RLE encoder. There weren't enough dashes to
overcome the loss of using the high-bit, however, and it added an extra 10K. I
think that some more could be squeezed out by allowing it to be a two-char
saver and using 0x40 as the switch. Reduced run lengths, but it would allow
for a second favourite character, ie the '-'. Not worth the effort though.
Right. Possibly not - depending on just how many of them there were.
GetMaxCharCount(). Oh boy. Lol. 'Nuff said !! :-))
Indeed. It's quite possible that by overriding more of the Encoding
methods, that wouldn't be a problem - I suspect it's used by the
default implementations though.
So too GetBytes and GetChars - oh, a copying we will go....
But, no worries, I know this was just an experiment of implementing an
Encoding.
:)
Chris, when you submit Jon's code, can I recommend that you make some
small changes. A key point in this method is the use of the top bit, so I
think this should be emphasised.

Instead of
builder.Append (new string (' ', b-126));
have
builder.Append (new string (' ', (b+2)-128));

Instead of
ms.WriteByte((byte)(spaces+126));
have
ms.WriteByte((byte)(128 + (spaces-2)));
Those make sense, yes.
And to show that this is in the same scheme:
if (t==' ')
{
spaces++;
if (spaces==130)
{
ms.WriteByte(255);
spaces=1;
}
Have
if (t==' ')
{
if (spaces==129)
{
ms.WriteByte(128 + (spaces -2));
spaces=0;
spaces++;
}
I presume you meant the spaces++; to come after the closing brace - and
note that you then need a cast to byte. One alternative would be:

if (t==' ')
{
spaces++;
if (spaces==129) // The most we can write in one byte
{
ms.WriteByte ((byte) (128+(spaces-2)));
spaces=0;
}
}

In other words, writing the spaces a bit more eagerly, and starting
with a clean slate, as it were.
Thus there is a consistency in the use of +/-2 and +/-128. Maybe you'd
even go as far as having a constant for the compressed-spaces indicator, ie
the top bit.
Possibly - although that would imply that it could be changed without
too much problem, whereas it can't - if it's anything *other* than the
top bit, you end up with it conflicting with ASCII. I'd prefer to just
document it carefully.
ps. And now I know how an Encoding looks from the inside. :-)


If you want a slightly more polished example, have a look at
http://www.pobox.com/~skeet/csharp/ebcdic

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #20
Hi Jon,

Effort for saving dashes? Using your code for '-' instead of ' ' reduced
it by c28700.

The corrections to the [if (t==' ') etc]. Yes, rather than copy from the
project (I'd only just got up and VS was still asleep) I copied the code from
the email and hacked away (too hurriedly, lol, I had an appointment to get
to). What you presumed was what I was actually using in my version of the live
code. But keeping the ++ early and writing as-soon-as ('eagerly') - yes, that
is cleaner.

With the constant I said 'maybe' because it's not hugely important. Anyone
working with bytes should recognise a 128 for what it is. But if I did have
the constant, I'd have a comment - top bit only because.. - or some such.

I've looked at the page for your other encoder but I'll leave the
examination for when the time comes. Shame it's for ebcdic - I tend to run
away from anything I can't pronounce, lol.! Lord of the Rings and Dune escaped
being devoured because of that!

Lader, Dude,

Best regards,
Fergus
Nov 15 '05 #21

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

51
by: Alan | last post by:
hi all, I want to define a constant length string, say 4 then in a function at some time, I want to set the string to a constant value, say a below is my code but it fails what is the correct...
0
by: Chris LaJoie | last post by:
I'm looking for some kind of simple string compression code I can use. I'm not looking for SharpZipLib. Their implimentation spans several classes and is very complex. I'm just looking for...
0
by: Robert Ludig | last post by:
How do I bind a textbox to a simple string varaible with databinding? I managed to do the binding but unfortnatedly the textvox does not get updated when I change the string wich the textbox is...
5
by: jeremyje | last post by:
I'm writing some code that will convert a regular string to a byte for compression and then beable to convert that compressed string back into original form. Conceptually I have.... For...
6
by: vedrandekovic | last post by:
Hello, I have one simple string, backspace character question.Here is my example: "HelloBSworld" Should this character "\b" (backspace) in this text return this: "Helloworld"?
3
by: sophia | last post by:
Dear all, the following is the file compression program ,using elimination of spaces, which I saw in a book #include<stdio.h> #include<stdlib.h> int main(int argc,char * argv) {
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.