473,382 Members | 1,736 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

Compressing data with lots of reoccurring streams of consecutive zero bits

Hi there.

First of, this is only for the fun of it, and I have no illusions that it will ever lead anywhere. I simply want to learn and have fun while doing it.

So, I've written a small program, which calculates all primes, using a regular home brewed sieve, within the range of 32 bits, the idea being that nothing more is needed to test values within the 64 bit range.

Every odd number from 1 - 2^32 - 1 is represented in a bitset. True if prime and otherwise false.

The data however takes up 256 mb of memory. This admittedly is hard to do anything about, but compressing and storing the data on disk is of course another matter entirely.

I've been thinking long and hard about the most efficient method of doing so, and have literally spend hours searching for common patterns and all that.

In the end though, only one approach seems to make sense. The thing is that the data contains a whole lots of long strings of zeros (that is bits not characters), all in all taking up around 231 mb of space, in some 230 million intervals.

This has to be hugely compressible, but I simply know way too little about the subject and don't really know where to start.

I'm currently considering some form of compound Hoffman coding, but I'm not entirely sure it's the right approach, and not sure at all how to implement it. I have however generated a list of stream of consecutive zeros and how often they occur, and I think it makes sense.

Expand|Select|Wrap|Line Numbers
  1. Number of consecutive zeros and number of occurrences:
  2.   2: 22851920|  3: 10204618|  4: 13227107|  5: 17178814
  3.   6:  9537739|  7:  7203001|  8: 13186171|  9:  7297439
  4.  10:  6257057| 11:  9566572| 12:  4581564| 13:  4994988
  5.  14:  9078006| 15:  2892527| 16:  3044808| 17:  5013791
  6.  18:  2373033| 19:  2822346| 20:  4222840| 21:  1740263
  7.  22:  1509880| 23:  2624423| 24:  1504728| 25:  1140228
  8.  26:  1916025| 27:   999131| 28:   862089| 29:  1795026
  9.  30:   569130| 31:   580761| 32:  1076487| 33:   443027
  10.  34:   635301| 35:   651041| 36:   322809| 37:   289021
  11.  38:   554641| 39:   287223| 40:   203909| 41:   425884
  12.  42:   154057| 43:   160181| 44:   325950| 45:   110541
  13.  46:   101330| 47:   179534| 48:    94932| 49:    99012
  14.  50:   132684| 51:    61061| 52:    53934| 53:    93571
  15.  54:    56929| 55:    44838| 56:    69166| 57:    29504
  16.  58:    29007| 59:    62466| 60:    19971| 61:    20185
  17.  62:    38812| 63:    14024| 64:    19315| 65:    24588
  18.  66:    10598| 67:     9393| 68:    18000| 69:    11242
  19.  70:     6786| 71:    11614| 72:     4993| 73:     5492
  20.  74:    11316| 75:     3611| 76:     4497| 77:     6341
  21.  78:     2622| 79:     3059| 80:     4086| 81:     1961
  22.  82:     1732| 83:     3871| 84:     1856| 85:     1232
  23.  86:     2207| 87:     1027| 88:      905| 89:     2006
  24.  90:      784| 91:      686| 92:     1070| 93:      425
  25.  94:      667| 95:      739| 96:      354| 97:      382
  26.  98:      648| 99:      315|100:      232|101:      413
  27. 102:      173|103:      170|104:      419|105:      124
  28. 106:       99|107:      188|108:       81|109:      114
  29. 110:      110|111:       66|112:       56|113:       87
  30. 114:       56|115:       32|116:       89|117:       32
  31. 118:       34|119:       56|120:       22|121:       18
  32. 122:       28|123:       17|124:       15|125:       25
  33. 126:        6|127:       10|128:       16|129:        8
  34. 130:        8|131:       10|132:       10|133:        5
  35. 134:       10|135:        2|136:        2|137:        7
  36. 138:        1|139:        4|140:        6|141:        3
  37. 142:        2|143:        5|144:        5|145:        3
  38. 151:        1|152:        1|154:        1|159:        1
  39. 167:        2|             |             |             
When all that is said, I'm mostly just interested in general advice. Have I missed any appropriate compression algorithms, do you have any ideas or advice, and so on and so forth.


Best regards.
Jun 15 '15 #1

✓ answered by Rabbit

I can't say I fully understand your compression scheme. But I pulled up my old LZW implementation and created an article from it.
http://bytes.com/topic/access/insigh...pt#post3792741

7 3763
Rabbit
12,516 Expert Mod 8TB
Why store a bitset? Primes are rare, instead of storing the result of everything and whether or not it is a prime, instead, store just the prime numbers and check against that array of numbers.
Jun 15 '15 #2
If I were to store all primes as you suggest, it would amount to 203.280.221*4 bytes, or ~756 mb, or twice that if you opt for 64 bit variables. Already that's a 300% (or 600%) argument against your suggestion. Furthermore, although I cannot be sure as I know little of such things, I think that the possibility of compressing said data would be vastly reduced.
Also, the approach I'm using of generating the initial range of primes, I believe, is more faster.

In other words, there are more primes than you assume, in the 32 bit range. Of course it thins out a bit as we get higher, but using the current representation, I also believe that the efficiency of my idea of compression, improves the higher we get.

Of course it was never my intention to test and store all primes in the 64 bit range. The testing alone would probably take some 10,000 years, but the idea of creating a piece of software which could actually do it, which I've done, apart from the storage requirement issue, which uncompressed would be equal to 1 exabyte, which few other than Google can muster, and not least the possibility of compressing the data to a degree where average consumer might actually have enough storage space, is rather compelling.

In the end though, it is of course only a thought experiment.


Now I'm not a very competent programmer when all is said and done, and I've rather neglected to comment my code (I'll try to quickly do something about that), but you sound like you could be interested in my approach, so I'll post that now and say thanks for your input.

Expand|Select|Wrap|Line Numbers
  1. #include <stdint.h>
  2. #include <math.h>
  3. // a couple of important constants and a few macros that are recomended for working with bitsets in C.
  4. #define WORD_BITS ( 8 * sizeof( uint_fast32_t ) )
  5. #define ARRAY_SIZE ( 0x10000000 / sizeof( uint_fast32_t ) )
  6. #define BITMASK( b ) ( ( uint_fast32_t ) 1 << ( b % WORD_BITS ) )
  7. #define BITSLOT( b ) ( b / WORD_BITS )
  8. #define BITSET( a, b ) ( (a)[BITSLOT( b ) ] |= BITMASK( b ) )
  9. #define BITTEST( a, b ) ( (a)[BITSLOT( b ) ] & BITMASK( b ) )
  10.  
  11. // Bitset Data.
  12. static uint_fast32_t b[ ARRAY_SIZE ];
  13.  
  14. // Fuction to save out data in binary format. All rather straight forward.
  15. int saveData( const char* fn )
  16. {
  17.     FILE* fp = fopen( fn, "wb" );
  18.     if( !fp ) return 0;
  19.     for( uint_fast32_t n = 0; n < ARRAY_SIZE; n++ )
  20.         fwrite( &b[ n ], sizeof( uint_fast32_t ), 1, fp);
  21.     fclose( fp );
  22.     return 1;
  23. }
  24.  
  25. // And of course we also need to load our data.
  26. int loadData( const char* fn )
  27. {
  28.     FILE* fp = fopen( fn, "rb" );
  29.     if( !fp ) return 0;
  30.     for(uint_fast32_t n = 0; n < ARRAY_SIZE; n++)
  31.         fread( &b[ n ], sizeof( uint_fast32_t ), 1, fp);
  32.     fclose(fp);
  33.     return 1;
  34. }
  35.  
  36. // for generating the initian primes, if that has not already been done.
  37. // note that we start out with a flipped bitset, which means true values are not primes, and false values are potential primes.
  38. // Also, we are working exclusively with odd numbers.
  39. void initialize() 
  40. {
  41.     // Check if a and load it if so, avoid a rather demanding initialization.
  42.     if( loadData( "p.bin" ) ) return;
  43.     // The first bit, which represent the number 1, is not prime
  44.     b[0] = 1;
  45.     // Now we need to find each prime, starting with 3, and iterate through the bitset and eliminate non primes.
  46.     // note that we only need to iterate for primes up 2^16
  47.     for( uint_fast32_t n = 1; n < 0x8000; n++ )
  48.     {
  49.         // if not prime, no need to test
  50.         if( BITTEST( b, n ) )
  51.             continue;
  52.         // otherwise we need to iterate through the entire bitset.
  53.         for( uint_fast32_t p = 2 * n + 1, i = p + n; i < 0x80000000; i += p )
  54.             BITSET( b, i );
  55.     }
  56.     // Lastly we flip the bitset.
  57.     for( uint_fast32_t n = 0; n < ARRAY_SIZE; n++ )
  58.         b[ n ] ^= UINT_FAST32_MAX;
  59.     // and of course save the data for later use.
  60.     saveData( "p.bin" );
  61. }
  62. // We of course also need the ability of testing for primarity.
  63. int isPrime(unsigned long long n)
  64. {
  65.     // if n is less than 2, or even, except for the special case of 2, fail.
  66.     if( n < 2 || ( !( n % 2 ) && n != 2 ) )
  67.         return 0;
  68.     // if n is in the 32 bit range, we can simply look it up
  69.     // note that in case of fail, we also need to check if n is equal to 2.
  70.     if( n <= 0xffffffff )
  71.         return ( ( BITTEST( b, ( n - 1 ) / 2 ) ) || n == 2 ) ? ( 1 ) : ( 0 );
  72.     // in case of larger than 32 bit values, we simply go brute force.
  73.     // note again that we need only test up to sqrt(n)
  74.     for( uint_fast32_t i = 1, p = 3; p <= ( uint_fast32_t ) sqrt( n ); p = 2 * ++i + 1 )
  75.         // if fail return 0
  76.         if( BITTEST( b, i ) && !( n % p ) )
  77.             return 0;
  78.     // Otherwise the obvious.
  79.     return 1;
  80. }
NB. if some of it seem overly complicated, and I'm sure it will, as I love one liners, I'll be happy to explain, and suggestions in regards to improvements are of course also welcome, though it's of little concern, unless we're talking rather big improvements.
Jun 15 '15 #3
Rabbit
12,516 Expert Mod 8TB
You're right, they are more common than I thought. Huffman compression is certainly a possibility, though it's a bit more complex than others. It all depends on how much compression you want to achieve. I've used the LZW algorithm before because it's easier to understand.

~200 MB doesn't seem that big though. Usually hard drive space isn't much of an issue anymore. With compression comes the overhead of uncompressing it when you want to use the data.
Jun 15 '15 #4
You are of course right. If all I wanted was a command line utility to test large primes, I could certainly live with the overhead of 256 mb, but of course I can't stop there. I have to see how far I can't take it, how efficient I can make the process, how much I can compress the data, and in the end how many primes I can actually store in a reasonable amount of space.

It's nothing but a challenge, like climbing mount Everest. There's really not much to do on the top, breathing is difficult at best, and you risk life and limbs getting there, yet many attempt to do so.

Anyway, I'll check out that LZW algorithm. Don't know the first thing about it, but I'm sure it'll be intriguing.


Best regards.
Jun 15 '15 #5
Okay, so I think I've finally come up with the perfect method of compression for this specific type of data, at least for a first pass. It is perfectly possible I think, that a second pass with some other type of compression algorithm can also be beneficial, but that's not really relevant at this point.

Firstly the uncompressed data will be split up into 256mb segments. These segments could be of a different size, but 256mb is convenient.

At the very beginning of the compressed data we will obviously need an index, which should not be too complicated, and at the start of each compressed segment, we will need a 32 bit header, 28 of which tells us the compressed size of the segment, and 4 of which tells us something about the small variation in the compression that may occur from segment to segment.

The actual compression is rather ingenious I think.
Basically we look for binary patterns of the form of 1 followed by a minimum number of 0's terminated by a 1 (or end of segment), where the last 1 is not actually part of the pattern, but tells us were it ends.

If said patter is of a given length, we simply replace it by a key of the form:

0111 b0 b1

0111 is the actual key we search for, once we uncompress, as it can occur only once at the very beginning of the very first segment, so it's perfect for the task.

The width of b0 is determined by the 4 bit part of our header and will of course not be wider than necessary, but it of course also need to be as wide as necessary.

The width of b1 is variable and determined by b0.

Thus if the pattern we wish to compress is for example:

1000 0000 0000 0000

our pattern might look something like:

0111 011 001

where b0 is the width of 3 bit, and for a 16 bit patter oddly enough also equal 3.
thus b1 is also the with of 3, but equal to 1.

It may not be obvious why this is, but if we imagine the shortest possible key, with the given width of b0 equal to 3:

0111 000

is is easy to see that the shortest pattern we can actually compress is 8 bit.

for 9 bit the pattern would then look as
0111 001 0
for 10
0111 001 1
for 11
0111 010 00
and so on, until we reach the pattern
0111 011 001
for our 16 bit pattern, which for that specific pattern would save us 6 bit, or 37.5%.

Now the real genius is of course that the wider the pattern, the more compression we we get, and for later segments, the patterns in general become longer and more common compared to the shorter and compressible patterns, and thus the algorithm become even more efficient.


Now, I'm not sure I explained this all that well, but I am sure that I don't have the slightest idea how to actually implement it.

In any case I'm very interested to hear any thoughts and/or ideas, or even criticism, because I must admit, I am quite pleased with my self, and perhaps that's not really warranted.


Best regards.
Jun 17 '15 #6
Rabbit
12,516 Expert Mod 8TB
I can't say I fully understand your compression scheme. But I pulled up my old LZW implementation and created an article from it.
http://bytes.com/topic/access/insigh...pt#post3792741
Jun 17 '15 #7
The LZW does well, ut I thing my approach can do better, though of course only for specialized purposes.

Never the less I'll take a look at that article. I'm sure it will make for interesting reading.
Jun 17 '15 #8

Sign in to post your reply or Sign up for a free account.

Similar topics

3
by: Dylan | last post by:
I'm attempting to write an object's state to a std::stringstream and then to restore that state by reading from the stringstream. The writing is fine but I'm having problems reading. Here's a...
15
by: I wish | last post by:
#include <string.h> int a; memset( a, 0, sizeof(a) ); Does that guarantee all bits zero? -- |
10
by: Angus Comber | last post by:
Hello My code below opens a Word document in binary mode and places the data into a buffer. I then want to search this buffer for a string. I tried using strstr but think it stops looking when...
53
by: Zhiqiang Ye | last post by:
Hi, All I am reading FAQ of this group. I have a question about this: http://www.eskimo.com/~scs/C-faq/q7.31.html It says: " p = malloc(m * n); memset(p, 0, m * n); The zero fill is...
8
by: junk5 | last post by:
Hi I need to read raw 16 bit data from a file, where the first byte is the most significant byte of the first data value and the second byte is the least significant byte of the first data value...
10
by: krunalb | last post by:
Hi, I am trying to shift unsigned long long value by 64 bits and this is what i get #include <stdio.h> int main() { unsigned short shiftby= 64;
5
by: Tarscher | last post by:
Hi all, My application uses a database to store big blobs (0.5 - 1 Gb). I have a List of a class Sample that I store in a blob. Since the blob can become very big I want to use compression. ...
23
by: Hallvard B Furuseth | last post by:
As far as I can tell, (x & -1) is nonzero if the integer x is negative zero. So for signed types, x == 0 does not guarantee (x & foo) == 0. Is that right? (Not that I expect to ever encounter a...
27
by: c.lang.myself | last post by:
i just got to know there are three different sections in which we can divide our program *Text section *Data section *Bss i think text section contain our code,data section include heap,stack...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.