468,457 Members | 1,718 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,457 developers. It's quick & easy.

Find all occurences of a bit sequence in a file

Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael

Feb 8 '07 #1
8 4840
Hello Daneel,

An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!

Don't know if you can use this method,

Remi J.

Daneel a écrit :
Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael
Feb 8 '07 #2
An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!
Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?

All the best,
Michael

Feb 8 '07 #3
Daneel a écrit :
>An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!

Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?
Hmmm i guess i did not explain myself properly. If you represent each
bit as a character, you do not need to be char/byte aligned. If your
binary content of your file is 0001001100110100, you need to represent
this binary as an array of 16 characters (so 16*8 bits) each char
corresponding to one of your bit with the value 0x30 or 0x31 ONLY (to
represent '0' or '1')

If you are searching for the pattern 0100 in binary, it corresponds to
the patterns 0x30 0x31 0x30 0x30 in your char array representation
Searching a pattern on such a string becomes really easy even if it asks
for 8th more memory to store the file content

Sincerely,

Remi J.
All the best,
Michael
Feb 8 '07 #4
An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!
Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?

Hmmm i guess i did not explain myself properly. If you represent each
bit as a character, you do not need to be char/byte aligned. If your
binary content of your file is 0001001100110100, you need to represent
this binary as an array of 16 characters (so 16*8 bits) each char
corresponding to one of your bit with the value 0x30 or 0x31 ONLY (to
represent '0' or '1')

If you are searching for the pattern 0100 in binary, it corresponds to
the patterns 0x30 0x31 0x30 0x30 in your char array representation

Searching a pattern on such a string becomes really easy even if it asks
for 8th more memory to store the file content
Oh, I see. How would you translate a binary file into such a string?

Thanks,
Michael

Feb 8 '07 #5
Daneel wrote:
Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael
See, e.g., http://en.wikipedia.org/wiki/Knuth-M...ratt_algorithm

You'll have better luck posting algorithm questions on comp.programming

-Mark
Feb 8 '07 #6
KoopaJah wrote:
Hello Daneel,

Please don't top-post. Your replies belong following or interspersed
with properly trimmed quotes. See the majority of other posts in the
newsgroup, or the group FAQ list:
<http://www.parashift.com/c++-faq-lite/how-to-post.html>
Feb 8 '07 #7
On Feb 8, 9:17 am, "Daneel" <michael.ransb...@gmail.comwrote:
Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael
(Assuming an 8-bit byte) if your pattern is 8 bits or less, it might
help
you to first fill in these two arrays:

unsigned char unsplit[256], split[256][256];

where unsplit[val] gives the number of occurences of the pattern
in the binary representation of "val", and split[val1][val2] gives the
number of occurences of the pattern in the (16-bit) concatenation
of the binary representations of "val1"and "val2", minus
(unsplit[val1] + unsplit[val2]). If b1, b2, b3 ... are the bytes in
the file, then the solution is:

unsplit[b1] + split[b1][b2] + unsplit[b2] + split[b2][b3]
+ unsplit[b3] + split[b3][b4] + ...

The file might have to be several hundred Kbytes for the overhead
of generating the arrays to be worth it. The rate of cache evictions
of the arrays would also probably have a big influence on
performance.

Feb 8 '07 #8
On Feb 8, 4:12 pm, "W Karas" <wka...@yahoo.comwrote:
On Feb 8, 9:17 am, "Daneel" <michael.ransb...@gmail.comwrote:
Hello!
I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).
I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?
It would be great to get some input, possibly with sample source code.
Many thanks,
Michael

(Assuming an 8-bit byte) if your pattern is 8 bits or less, it might
help
you to first fill in these two arrays:

unsigned char unsplit[256], split[256][256];

where unsplit[val] gives the number of occurences of the pattern
in the binary representation of "val", and split[val1][val2] gives the
number of occurences of the pattern in the (16-bit) concatenation
of the binary representations of "val1"and "val2", minus
(unsplit[val1] + unsplit[val2]). If b1, b2, b3 ... are the bytes in
the file, then the solution is:

unsplit[b1] + split[b1][b2] + unsplit[b2] + split[b2][b3]
+ unsplit[b3] + split[b3][b4] + ...

The file might have to be several hundred Kbytes for the overhead
of generating the arrays to be worth it. The rate of cache evictions
of the arrays would also probably have a big influence on
performance.
Another method to do this would to be to make 8 copies of your search
pattern. Making each copy shifted by one bit. (Note, after 8 shifts
you have what you started with + 1 byte of zeros.)

Then perform a regular NxM comparison of all 8 copies on the input
data paying special attention to the last byte of the pattern. (First
shited copy you're going to want to only match 7 of 8 bits, second
shifted copy 6 of 8) and so on.

Then its just adding up the total matches for each shifted pattern.
The benefit you get from this are that you can apply any of the string
matching algorithms with a slight modification for the last byte match
(There are string matching algorithms that are sub-linear).

Hth,
Paul Davis

Feb 9 '07 #9

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by John Jørgensen | last post: by
108 posts views Thread by Bryan Olson | last post: by
4 posts views Thread by Jason Gleason | last post: by
4 posts views Thread by Dameon | last post: by
reply views Thread by NPC403 | last post: by
1 post views Thread by subhajit12345 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.