bitcounting on 30000+ bits, but not byte aligned

Eric

I have an array that contains over 30000+ bits. The size varies at
runtime. I need to move through this chunk of memory and count bits
every so often. Example: First 9 bits has 2 ones, next 10 bits has 4
ones, next 9 bits has 8 ones ... I know the number of bits that need
to be counted at runtime only. But I do know it will be two different
lengths that are consecutive numbers, it could be just one length but
that will be rare. I know the pattern the lengths come in at runtime.
The lengths will normally be around 32. I am using this to scale
down large 1bit images using a box filter. The different lengths come
from a DDA that creates a filter that evenly spreads out the small box
size and large box size filter over the image. So I will actually do
this same bitcounting 30000+ times depending on the height of the
image. Speed is very important. I don't need asm though since it
will run on different platforms. I think there should be some way to
precompute alot of this since all the rows will have the same pattern
of bit lenghts that I need to count. I know about using a precomputed
table for counting bits in 1 byte and I am sure that will part of it.
But what else can I precompute to save time? Thank you.

Jul 22 '05 #1

Subscribe Post Reply

1486

Walter Tross

Eric 2004-06-24 :

I have an array that contains over 30000+ bits. The size varies at
runtime. I need to move through this chunk of memory and count bits
every so often. Example: First 9 bits has 2 ones, next 10 bits has 4
ones, next 9 bits has 8 ones ... I know the number of bits that need
to be counted at runtime only. But I do know it will be two different
lengths that are consecutive numbers, it could be just one length but
that will be rare. I know the pattern the lengths come in at runtime.
The lengths will normally be around 32. I am using this to scale
down large 1bit images using a box filter. The different lengths come
from a DDA that creates a filter that evenly spreads out the small box
size and large box size filter over the image. So I will actually do
this same bitcounting 30000+ times depending on the height of the
image. Speed is very important. I don't need asm though since it
will run on different platforms. I think there should be some way to
precompute alot of this since all the rows will have the same pattern
of bit lenghts that I need to count. I know about using a precomputed
table for counting bits in 1 byte and I am sure that will part of it.
But what else can I precompute to save time? Thank you.

I guess you would be more on-topic in comp.graphics.algorithms, although
I would try to explain about the "lengths" a bit better, if I were you, I
didn't get it, although I have done 1-bit image processing professionally.
All that comes to my mind is: if the all zeroes and/or the all ones cases
are frequent, handle them whith an if before doing anything else.
BTW, I guess you know this trick for counting bits:
bits = 0;
while (x) {
x &= x - 1;
bits++;
}

Walter Tross

Jul 22 '05 #2

CFG

Why don't just use lookup table?

bits = lut[x]; // lut contains 256 elements and x is a byte

instead of

while (x) {
x &= x - 1;
bits++;
}

Jul 22 '05 #3

Walter Tross

CFG 2004-06-25 :

Why don't just use lookup table?

bits = lut[x]; // lut contains 256 elements and x is a byte

instead of

while (x) {
x &= x - 1;
bits++;
}

Maybe I was not clear enough. I said "BTW", and I meant the trick to be
used mainly to build the lookup table. It can also be used instead of the
lookup table when ones are known to be very rare (it practically includes
the if (!x) which I would put in front of the rest if ones are less rare)

Walter Tross

Jul 22 '05 #4

John Cochran

In article <a6************************@posting.google.com>,
Eric <em*****@myrealbox.com> wrote:

I have an array that contains over 30000+ bits. The size varies at
runtime. I need to move through this chunk of memory and count bits
every so often. Example: First 9 bits has 2 ones, next 10 bits has 4
ones, next 9 bits has 8 ones ... I know the number of bits that need
to be counted at runtime only. But I do know it will be two different
lengths that are consecutive numbers, it could be just one length but
that will be rare. I know the pattern the lengths come in at runtime.
The lengths will normally be around 32. I am using this to scale
down large 1bit images using a box filter. The different lengths come
from a DDA that creates a filter that evenly spreads out the small box
size and large box size filter over the image. So I will actually do
this same bitcounting 30000+ times depending on the height of the
image. Speed is very important. I don't need asm though since it
will run on different platforms. I think there should be some way to
precompute alot of this since all the rows will have the same pattern
of bit lenghts that I need to count. I know about using a precomputed
table for counting bits in 1 byte and I am sure that will part of it.
But what else can I precompute to save time? Thank you.

You already mention using a precomputed table which is a very good
start. The only other thing I can think of is if your range doesn't
start or stop on an even byte boundary. If that's the case, you will
also want a small table of masks to and with the byte prior to using
the lookup table. Some quick pseudo code would be.

handle 1st byte which is potentially partial.
handle all full bytes.
handle last byte which is also potentially partial.

The 1st and last bytes would by handled by ANDing them prior to using
the lookup table.

Also if you have a reasonably small upper limit to the length you're
checking, you may wish to handle the group of whole bytes by using
a switch statement. Something like this:

switch(number_of_bytes) {
case 8: count += bit_count[*p++];
case 7: count += bit_count[*p++];
case 6: count += bit_count[*p++];
case 5: count += bit_count[*p++];
case 4: count += bit_count[*p++];
case 3: count += bit_count[*p++];
case 2: count += bit_count[*p++];
case 1: count += bit_count[*p++];
}

Jul 22 '05 #5

CFG

Also if you have a reasonably small upper limit to the length you're
checking, you may wish to handle the group of whole bytes by using
a switch statement. Something like this:

switch(number_of_bytes) {
case 8: count += bit_count[*p++];
case 7: count += bit_count[*p++];
case 6: count += bit_count[*p++];
case 5: count += bit_count[*p++];
case 4: count += bit_count[*p++];
case 3: count += bit_count[*p++];
case 2: count += bit_count[*p++];
case 1: count += bit_count[*p++];
}

I would rewrite this switch as follows:
switch(number_of_bytes) {
case 8: count += bit_count[p[7]];
case 7: count += bit_count[p[6]];
case 6: count += bit_count[p[5]];
case 5: count += bit_count[p[4]];
case 4: count += bit_count[p[3]];
case 3: count += bit_count[p[2]];
case 2: count += bit_count[p[1]];
case 1: count += bit_count[p[0]];
}
p+= 8;

Jul 22 '05 #6

John Cochran

In article <10***************@news.pubnix.net>, CFG <cf@fg.com> wrote:

I would rewrite this switch as follows:
switch(number_of_bytes) {
case 8: count += bit_count[p[7]];
case 7: count += bit_count[p[6]];
case 6: count += bit_count[p[5]];
case 5: count += bit_count[p[4]];
case 4: count += bit_count[p[3]];
case 3: count += bit_count[p[2]];
case 2: count += bit_count[p[1]];
case 1: count += bit_count[p[0]];
}
p+= 8;

^^^^^^

p += number_of_bytes;

Jul 22 '05 #7

Eric

jd*@smof.fiawol.org (John Cochran) wrote in message news:<cb**********@smof.fiawol.org>...

You already mention using a precomputed table which is a very good
start. The only other thing I can think of is if your range doesn't
start or stop on an even byte boundary. If that's the case, you will
also want a small table of masks to and with the byte prior to using
the lookup table. Some quick pseudo code would be.

handle 1st byte which is potentially partial.
handle all full bytes.
handle last byte which is also potentially partial.

The 1st and last bytes would by handled by ANDing them prior to using
the lookup table.
This is exactly what I was thinking of. I think I will also need a
table of offsets from the start of the array to tell me the byte each
group starts on because most of the time the last byte will be partial
causing the next group to start on the same byte, but sometimes it
will land evenly on a byte boundary causing the next group to start on
the next byte.

Also if you have a reasonably small upper limit to the length you're
checking, you may wish to handle the group of whole bytes by using
a switch statement. Something like this:
I hadn't thought of doing this. Thanks. I may use this along with
adding a default: that can handle any number of bytes. That way I get
the speed up for the most common cases, but I don't break if I need a
to use a longer group of whole bytes. I could even build a 2 byte
table and use it a long with my 1 byte table. I would need to test
that though. The 2 byte table may cause to many cash misses. I am
trying to keep the all the look up tables and the array of bits in
cache. Glad L2 cache is so big now. Should I try to keep it in L1 or
is L2 fine? I need to find that out. The array of bits probably
doesn't get much benefit from being in cache though since most of the
bytes are only read once. Although the memory address doesn't change.

switch(number_of_bytes) {
case 8: count += bit_count[*p++];
case 7: count += bit_count[*p++];
case 6: count += bit_count[*p++];
case 5: count += bit_count[*p++];
case 4: count += bit_count[*p++];
case 3: count += bit_count[*p++];
case 2: count += bit_count[*p++];
case 1: count += bit_count[*p++];
}

I don't really see how this will count the bits though. Say
number_of_bytes = 4 then count += bit_count[*p++}, so that gives me
the bitcount of one byte and you exit the switch statment. Should it
not be written as.
switch(number_of_bytes) {
case 8: count = bit_count[*p++] + bit_count[*p++] + ...;
case 7: count = bit_count[*p++] + bit_count[*p++] + ...;
case 6: count = bit_count[*p++] + bit_count[*p++] + ...;
case 5: count = bit_count[*p++] + bit_count[*p++] + ...;
case 4: count = bit_count[*p++] + bit_count[*p++] + ...;
case 3: count = bit_count[*p++] + bit_count[*p++] + ...;
case 2: count = bit_count[*p++] + bit_count[*p++];
case 1: count = bit_count[*p++];
}

Jul 22 '05 #8

John Cochran

In article <a6*************************@posting.google.com> ,
Eric <em*****@myrealbox.com> wrote:
SNIP...

switch(number_of_bytes) {
case 8: count += bit_count[*p++];
case 7: count += bit_count[*p++];
case 6: count += bit_count[*p++];
case 5: count += bit_count[*p++];
case 4: count += bit_count[*p++];
case 3: count += bit_count[*p++];
case 2: count += bit_count[*p++];
case 1: count += bit_count[*p++];
}

I don't really see how this will count the bits though. Say
number_of_bytes = 4 then count += bit_count[*p++}, so that gives me
the bitcount of one byte and you exit the switch statment. Should it
not be written as.
switch(number_of_bytes) {
case 8: count = bit_count[*p++] + bit_count[*p++] + ...;
case 7: count = bit_count[*p++] + bit_count[*p++] + ...;
case 6: count = bit_count[*p++] + bit_count[*p++] + ...;
case 5: count = bit_count[*p++] + bit_count[*p++] + ...;
case 4: count = bit_count[*p++] + bit_count[*p++] + ...;
case 3: count = bit_count[*p++] + bit_count[*p++] + ...;
case 2: count = bit_count[*p++] + bit_count[*p++];
case 1: count = bit_count[*p++];
}

Nope. You forgot about the unique behaivor of switch statements in C and C++.
Namely that their default behaivor is to fall through to the next case unless
a "break" statement is encountered. Notice that the example I gave you
doesn't have any break statements. Also notice that the order of the case
statements is decending. Your idea of having a default clause is a good one.
The method that I would use would be

switch(number_of_bytes) {
default:
// Loop to handle many bytes....
break;
case 8: count = bit_count[*p++];
case 7: count = bit_count[*p++];
case 6: count = bit_count[*p++];
case 5: count = bit_count[*p++];
case 4: count = bit_count[*p++];
case 3: count = bit_count[*p++];
case 2: count = bit_count[*p++];
case 1: count = bit_count[*p++];
case 0: ;
}

Notice the additional case 0 and the fact that I placed the default FIRST.
The reason for the added case 0 is because in the first example without a
default, the 0 case did no work. In the new version, the default does work
and I don't see any good reason to have the zero case tested twice (once by
the switch, again by the loop in the default case). The reason that I placed
the default case first is to allow the fastpath (cases 0 through 8) to
possibly fall off the end of the switch statement and avoid an unneeded
jump opcode in the compiled version. The default path will have a jump
opcode to skip past the other cases, but since it's the slow path anyway,
an extra jump shouldn't be noticed.

Come to think of it, if your loop in the default case processes only
(number_of_bytes - 8) bytes, you can get rid of the break statement and
allow it to fall through to the rest of the switch statement.

Jul 22 '05 #9

CFG

My bad :-)

Jul 22 '05 #10

josh

Eric wrote:

I have an array that contains over 30000+ bits. The size varies at
runtime. I need to move through this chunk of memory and count bits
every so often. Example: First 9 bits has 2 ones, next 10 bits has 4
ones, next 9 bits has 8 ones ... I know the number of bits that need
to be counted at runtime only. But I do know it will be two different
lengths that are consecutive numbers, it could be just one length but
that will be rare. I know the pattern the lengths come in at runtime.
The lengths will normally be around 32. I am using this to scale
down large 1bit images using a box filter. The different lengths come
from a DDA that creates a filter that evenly spreads out the small box
size and large box size filter over the image. So I will actually do
this same bitcounting 30000+ times depending on the height of the
image. Speed is very important. I don't need asm though since it
will run on different platforms. I think there should be some way to
precompute alot of this since all the rows will have the same pattern
of bit lenghts that I need to count. I know about using a precomputed
table for counting bits in 1 byte and I am sure that will part of it.
But what else can I precompute to save time? Thank you.

So you might get something like:
32 33 33 33 32 32 33 33 32 ...
but it would be the same for every row?
Is there an upper limit on the length of a run?

You might try doing it sideways, but that wouldn't play very nicely with
the cache. If you could flip the bitmap on its diagonal and interleave
things in blocks of unsigned long, you could cut your passes by a factor
of sizeof(unsigned long)*CHAR_BIT. But moving all the data around would
probably negate any benefit you'd get from that.

You might run the DDA once and store each position as a byte increment
and a mask:
struct RunRec
{
unsigned cbAdvance;
std::vector<unsigned char> Mask;
};
Then at each step, advance the pointer by cbAdvance bytes, and the value
there with the Mask, and count how many bits you have. Then you don't
need bit shifting for every row. (of course, std::vector<unsigned char>
is probably not the best choice...)

-josh

Jul 22 '05 #11

by: Kristian Nybo | last post by:

Hi, I'm writing a simple image file exporter as part of a school project. To implement my image format of choice I need to work with big-endian bytes, where 'byte' of course means '8 bits', not...

C / C++

Byte Alignment

by: Shashi | last post by:

Can somebody explain how the byte alignment for structures work, taking the following example and considering: byte of 1 Byte word of 2 Bytes dword of 4 Bytes typedef struct { byte a; word...

C / C++

Binary files, little&big endian setting bits

by: Steve | last post by:

Hi, i know this is an old question (sorry) but its a different problem, i need to write a binary file as follows 00000011 00000000 00000000 00000101 00000000 11111111

C / C++

accessing char as int through union

by: Hallvard B Furuseth | last post by:

Is the code below valid? Generally a value must be accessed through the same type it was stored as, but there is an exception for data stored through a character type. I'm not sure if that...

C / C++

Properly aligned dynamic buffer

by: Spoon | last post by:

Hello everyone, As far as I understand, if I request a uint8_t buffer, it could be allocated anywhere. uint8_t *buf = new uint8_t By anywhere, I mean e.g. it could start at an odd address....

C / C++

Allocating memory aligned on a given boundary

by: K. Jennings | last post by:

I would like to get the result of malloc() such that it is aligned on a given boundary - 8, in this case. Assuming that sizeof(u64) is 8, will u64 *p = malloc(N) ; do the trick in general? ...

C / C++

malloc(allocator) aligned by 8 or 16 byte

by: shaanxxx | last post by:

why malloc (allocator) guarantees that address return by them will be aligned by 8 byte ( on 32bit machin ) or 16 byte (64 bit machin) ?

C / C++

the address returned from "new" statement is 32 bit aligned?

by: Asm23 | last post by:

Hi i'm using intel P4. when I write the statement like char *p=new char what's the p value, is it aligned by 4 bytes? that's is p%4==0? and, now I'm using the Intel SSE2 instruction to do...

C / C++

Working with non byte aligned data values in memory

by: RichG | last post by:

I'm working with a data stream of 8 bytes in an embedded application. In most cases the data is byte aligned so I can define a structure and then memcpy the data directly to the structure elements. ...

C / C++

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

bitcounting on 30000+ bits, but not byte aligned

Similar topics