Compressing data with lots of reoccurring streams of consecutive zero bits

Hi there.

First of, this is only for the fun of it, and I have no illusions that it will ever lead anywhere. I simply want to learn and have fun while doing it.

So, I've written a small program, which calculates all primes, using a regular home brewed sieve, within the range of 32 bits, the idea being that nothing more is needed to test values within the 64 bit range.

Every odd number from 1 - 2^32 - 1 is represented in a bitset. True if prime and otherwise false.

The data however takes up 256 mb of memory. This admittedly is hard to do anything about, but compressing and storing the data on disk is of course another matter entirely.

I've been thinking long and hard about the most efficient method of doing so, and have literally spend hours searching for common patterns and all that.

In the end though, only one approach seems to make sense. The thing is that the data contains a whole lots of long strings of zeros (that is bits not characters), all in all taking up around 231 mb of space, in some 230 million intervals.

This has to be hugely compressible, but I simply know way too little about the subject and don't really know where to start.

I'm currently considering some form of compound Hoffman coding, but I'm not entirely sure it's the right approach, and not sure at all how to implement it. I have however generated a list of stream of consecutive zeros and how often they occur, and I think it makes sense.

Expand|Select|Wrap|Line Numbers

 Number of consecutive zeros and number of occurrences:

  2: 22851920|  3: 10204618|  4: 13227107|  5: 17178814

  6:  9537739|  7:  7203001|  8: 13186171|  9:  7297439

 10:  6257057| 11:  9566572| 12:  4581564| 13:  4994988

 14:  9078006| 15:  2892527| 16:  3044808| 17:  5013791

 18:  2373033| 19:  2822346| 20:  4222840| 21:  1740263

 22:  1509880| 23:  2624423| 24:  1504728| 25:  1140228

 26:  1916025| 27:   999131| 28:   862089| 29:  1795026

 30:   569130| 31:   580761| 32:  1076487| 33:   443027

 34:   635301| 35:   651041| 36:   322809| 37:   289021

 38:   554641| 39:   287223| 40:   203909| 41:   425884

 42:   154057| 43:   160181| 44:   325950| 45:   110541

 46:   101330| 47:   179534| 48:    94932| 49:    99012

 50:   132684| 51:    61061| 52:    53934| 53:    93571

 54:    56929| 55:    44838| 56:    69166| 57:    29504

 58:    29007| 59:    62466| 60:    19971| 61:    20185

 62:    38812| 63:    14024| 64:    19315| 65:    24588

 66:    10598| 67:     9393| 68:    18000| 69:    11242

 70:     6786| 71:    11614| 72:     4993| 73:     5492

 74:    11316| 75:     3611| 76:     4497| 77:     6341

 78:     2622| 79:     3059| 80:     4086| 81:     1961

 82:     1732| 83:     3871| 84:     1856| 85:     1232

 86:     2207| 87:     1027| 88:      905| 89:     2006

 90:      784| 91:      686| 92:     1070| 93:      425

 94:      667| 95:      739| 96:      354| 97:      382

 98:      648| 99:      315|100:      232|101:      413

102:      173|103:      170|104:      419|105:      124

106:       99|107:      188|108:       81|109:      114

110:      110|111:       66|112:       56|113:       87

114:       56|115:       32|116:       89|117:       32

118:       34|119:       56|120:       22|121:       18

122:       28|123:       17|124:       15|125:       25

126:        6|127:       10|128:       16|129:        8

130:        8|131:       10|132:       10|133:        5

134:       10|135:        2|136:        2|137:        7

138:        1|139:        4|140:        6|141:        3

142:        2|143:        5|144:        5|145:        3

151:        1|152:        1|154:        1|159:        1

167:        2|             |             |

When all that is said, I'm mostly just interested in general advice. Have I missed any appropriate compression algorithms, do you have any ideas or advice, and so on and so forth.

Best regards.

Jun 15 '15 #1

Subscribe Post Reply

✓ answered by Rabbit

I can't say I fully understand your compression scheme. But I pulled up my old LZW implementation and created an article from it.
http://bytes.com/topic/access/insigh...pt#post3792741

3763

Rabbit

12,516

Expert Mod 8TB

Why store a bitset? Primes are rare, instead of storing the result of everything and whether or not it is a prime, instead, store just the prime numbers and check against that array of numbers.

Jun 15 '15 #2

Zacariaz

If I were to store all primes as you suggest, it would amount to 203.280.221*4 bytes, or ~756 mb, or twice that if you opt for 64 bit variables. Already that's a 300% (or 600%) argument against your suggestion. Furthermore, although I cannot be sure as I know little of such things, I think that the possibility of compressing said data would be vastly reduced.
Also, the approach I'm using of generating the initial range of primes, I believe, is more faster.

In other words, there are more primes than you assume, in the 32 bit range. Of course it thins out a bit as we get higher, but using the current representation, I also believe that the efficiency of my idea of compression, improves the higher we get.

Of course it was never my intention to test and store all primes in the 64 bit range. The testing alone would probably take some 10,000 years, but the idea of creating a piece of software which could actually do it, which I've done, apart from the storage requirement issue, which uncompressed would be equal to 1 exabyte, which few other than Google can muster, and not least the possibility of compressing the data to a degree where average consumer might actually have enough storage space, is rather compelling.

In the end though, it is of course only a thought experiment.

Now I'm not a very competent programmer when all is said and done, and I've rather neglected to comment my code (I'll try to quickly do something about that), but you sound like you could be interested in my approach, so I'll post that now and say thanks for your input.

Expand|Select|Wrap|Line Numbers

 #include <stdint.h>

#include <math.h>

// a couple of important constants and a few macros that are recomended for working with bitsets in C.

#define WORD_BITS ( 8 * sizeof( uint_fast32_t ) )

#define ARRAY_SIZE ( 0x10000000 / sizeof( uint_fast32_t ) )

#define BITMASK( b ) ( ( uint_fast32_t ) 1 << ( b % WORD_BITS ) )

#define BITSLOT( b ) ( b / WORD_BITS )

#define BITSET( a, b ) ( (a)[BITSLOT( b ) ] |= BITMASK( b ) )

#define BITTEST( a, b ) ( (a)[BITSLOT( b ) ] & BITMASK( b ) )
 
// Bitset Data.

static uint_fast32_t b[ ARRAY_SIZE ];
 
// Fuction to save out data in binary format. All rather straight forward.

int saveData( const char* fn )

{

    FILE* fp = fopen( fn, "wb" );

    if( !fp ) return 0;

    for( uint_fast32_t n = 0; n < ARRAY_SIZE; n++ )

        fwrite( &b[ n ], sizeof( uint_fast32_t ), 1, fp);

    fclose( fp );

    return 1;

}
 
// And of course we also need to load our data.

int loadData( const char* fn )

{

    FILE* fp = fopen( fn, "rb" );

    if( !fp ) return 0;

    for(uint_fast32_t n = 0; n < ARRAY_SIZE; n++)

        fread( &b[ n ], sizeof( uint_fast32_t ), 1, fp);

    fclose(fp);

    return 1;

}
 
// for generating the initian primes, if that has not already been done.

// note that we start out with a flipped bitset, which means true values are not primes, and false values are potential primes.

// Also, we are working exclusively with odd numbers.

void initialize() 

{

    // Check if a and load it if so, avoid a rather demanding initialization.

    if( loadData( "p.bin" ) ) return;

    // The first bit, which represent the number 1, is not prime

    b[0] = 1;

    // Now we need to find each prime, starting with 3, and iterate through the bitset and eliminate non primes.

    // note that we only need to iterate for primes up 2^16

    for( uint_fast32_t n = 1; n < 0x8000; n++ )

    {

        // if not prime, no need to test

        if( BITTEST( b, n ) )

            continue;

        // otherwise we need to iterate through the entire bitset.

        for( uint_fast32_t p = 2 * n + 1, i = p + n; i < 0x80000000; i += p )

            BITSET( b, i );

    }

    // Lastly we flip the bitset.

    for( uint_fast32_t n = 0; n < ARRAY_SIZE; n++ )

        b[ n ] ^= UINT_FAST32_MAX;

    // and of course save the data for later use.

    saveData( "p.bin" );

}

// We of course also need the ability of testing for primarity.

int isPrime(unsigned long long n)

{

    // if n is less than 2, or even, except for the special case of 2, fail.

    if( n < 2 || ( !( n % 2 ) && n != 2 ) )

        return 0;

    // if n is in the 32 bit range, we can simply look it up

    // note that in case of fail, we also need to check if n is equal to 2.

    if( n <= 0xffffffff )

        return ( ( BITTEST( b, ( n - 1 ) / 2 ) ) || n == 2 ) ? ( 1 ) : ( 0 );

    // in case of larger than 32 bit values, we simply go brute force.

    // note again that we need only test up to sqrt(n)

    for( uint_fast32_t i = 1, p = 3; p <= ( uint_fast32_t ) sqrt( n ); p = 2 * ++i + 1 )

        // if fail return 0

        if( BITTEST( b, i ) && !( n % p ) )

            return 0;

    // Otherwise the obvious.

    return 1;

}

NB. if some of it seem overly complicated, and I'm sure it will, as I love one liners, I'll be happy to explain, and suggestions in regards to improvements are of course also welcome, though it's of little concern, unless we're talking rather big improvements.

Jun 15 '15 #3

Rabbit

12,516

Expert Mod 8TB

You're right, they are more common than I thought. Huffman compression is certainly a possibility, though it's a bit more complex than others. It all depends on how much compression you want to achieve. I've used the LZW algorithm before because it's easier to understand.

~200 MB doesn't seem that big though. Usually hard drive space isn't much of an issue anymore. With compression comes the overhead of uncompressing it when you want to use the data.

Jun 15 '15 #4

Zacariaz

You are of course right. If all I wanted was a command line utility to test large primes, I could certainly live with the overhead of 256 mb, but of course I can't stop there. I have to see how far I can't take it, how efficient I can make the process, how much I can compress the data, and in the end how many primes I can actually store in a reasonable amount of space.

It's nothing but a challenge, like climbing mount Everest. There's really not much to do on the top, breathing is difficult at best, and you risk life and limbs getting there, yet many attempt to do so.

Anyway, I'll check out that LZW algorithm. Don't know the first thing about it, but I'm sure it'll be intriguing.

Best regards.

Jun 15 '15 #5

Zacariaz

Okay, so I think I've finally come up with the perfect method of compression for this specific type of data, at least for a first pass. It is perfectly possible I think, that a second pass with some other type of compression algorithm can also be beneficial, but that's not really relevant at this point.

Firstly the uncompressed data will be split up into 256mb segments. These segments could be of a different size, but 256mb is convenient.

At the very beginning of the compressed data we will obviously need an index, which should not be too complicated, and at the start of each compressed segment, we will need a 32 bit header, 28 of which tells us the compressed size of the segment, and 4 of which tells us something about the small variation in the compression that may occur from segment to segment.

The actual compression is rather ingenious I think.
Basically we look for binary patterns of the form of 1 followed by a minimum number of 0's terminated by a 1 (or end of segment), where the last 1 is not actually part of the pattern, but tells us were it ends.

If said patter is of a given length, we simply replace it by a key of the form:

0111 b0 b1

0111 is the actual key we search for, once we uncompress, as it can occur only once at the very beginning of the very first segment, so it's perfect for the task.

The width of b0 is determined by the 4 bit part of our header and will of course not be wider than necessary, but it of course also need to be as wide as necessary.

The width of b1 is variable and determined by b0.

Thus if the pattern we wish to compress is for example:

1000 0000 0000 0000

our pattern might look something like:

0111 011 001

where b0 is the width of 3 bit, and for a 16 bit patter oddly enough also equal 3.
thus b1 is also the with of 3, but equal to 1.

It may not be obvious why this is, but if we imagine the shortest possible key, with the given width of b0 equal to 3:

0111 000

is is easy to see that the shortest pattern we can actually compress is 8 bit.

for 9 bit the pattern would then look as
0111 001 0
for 10
0111 001 1
for 11
0111 010 00
and so on, until we reach the pattern
0111 011 001
for our 16 bit pattern, which for that specific pattern would save us 6 bit, or 37.5%.

Now the real genius is of course that the wider the pattern, the more compression we we get, and for later segments, the patterns in general become longer and more common compared to the shorter and compressible patterns, and thus the algorithm become even more efficient.

Now, I'm not sure I explained this all that well, but I am sure that I don't have the slightest idea how to actually implement it.

In any case I'm very interested to hear any thoughts and/or ideas, or even criticism, because I must admit, I am quite pleased with my self, and perhaps that's not really warranted.

Best regards.

Jun 17 '15 #6

Rabbit

12,516

Expert Mod 8TB

I can't say I fully understand your compression scheme. But I pulled up my old LZW implementation and created an article from it.
http://bytes.com/topic/access/insigh...pt#post3792741

Jun 17 '15 #7

Zacariaz

The LZW does well, ut I thing my approach can do better, though of course only for specialized purposes.

Never the less I'll take a look at that article. I'm sure it will make for interesting reading.

Jun 17 '15 #8

Similar topics

extracting data from string streams

by: Dylan | last post by:

I'm attempting to write an object's state to a std::stringstream and then to restore that state by reading from the stringstream. The writing is fine but I'm having problems reading. Here's a...

C / C++

How can I set a integer array all bits to zero?

by: I wish | last post by:

#include <string.h> int a; memset( a, 0, sizeof(a) ); Does that guarantee all bits zero? -- |

C / C++

How to find a string in a stream of binary data?

by: Angus Comber | last post by:

Hello My code below opens a Word document in binary mode and places the data into a buffer. I then want to search this buffer for a string. I tried using strstr but think it stops looking when...

C / C++

memset all bits to zero will let float/double to zero?

by: Zhiqiang Ye | last post by:

Hi, All I am reading FAQ of this group. I have a question about this: http://www.eskimo.com/~scs/C-faq/q7.31.html It says: " p = malloc(m * n); memset(p, 0, m * n); The zero fill is...

C / C++

Reading 16-bit raw data

by: junk5 | last post by:

Hi I need to read raw 16 bit data from a file, where the first byte is the most significant byte of the first data value and the second byte is the least significant byte of the first data value...

C / C++

Shifting unsigned long long values by 64 bits

by: krunalb | last post by:

Hi, I am trying to shift unsigned long long value by 64 bits and this is what i get #include <stdio.h> int main() { unsigned short shiftby= 64;

C / C++

compressing data stored in a database

by: Tarscher | last post by:

Hi all, My application uses a database to store big blobs (0.5 - 1 Gb). I have a List of a class Sample that I store in a blob. Since the blob can become very big I want to use compression. ...

C# / C Sharp

x == 0 && (x & -1) != 0 for negative zero?

by: Hallvard B Furuseth | last post by:

As far as I can tell, (x & -1) is nonzero if the integer x is negative zero. So for signed types, x == 0 does not guarantee (x & foo) == 0. Is that right? (Not that I expect to ever encounter a...

C / C++

text,data and bss

by: c.lang.myself | last post by:

i just got to know there are three different sections in which we can divide our program *Text section *Data section *Bss i think text section contain our code,data section include heap,stack...

C / C++

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware