473,224 Members | 1,602 Online

# Find all occurences of a bit sequence in a file

Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael

Feb 8 '07 #1
8 5564
Hello Daneel,

An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!

Don't know if you can use this method,

Remi J.

Daneel a écrit :
Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael
Feb 8 '07 #2
An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!
Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?

All the best,
Michael

Feb 8 '07 #3
Daneel a écrit :
>An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!

Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?
Hmmm i guess i did not explain myself properly. If you represent each
bit as a character, you do not need to be char/byte aligned. If your
binary content of your file is 0001001100110100, you need to represent
this binary as an array of 16 characters (so 16*8 bits) each char
corresponding to one of your bit with the value 0x30 or 0x31 ONLY (to
represent '0' or '1')

If you are searching for the pattern 0100 in binary, it corresponds to
the patterns 0x30 0x31 0x30 0x30 in your char array representation
Searching a pattern on such a string becomes really easy even if it asks
for 8th more memory to store the file content

Sincerely,

Remi J.
All the best,
Michael
Feb 8 '07 #4
An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!
Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?

Hmmm i guess i did not explain myself properly. If you represent each
bit as a character, you do not need to be char/byte aligned. If your
binary content of your file is 0001001100110100, you need to represent
this binary as an array of 16 characters (so 16*8 bits) each char
corresponding to one of your bit with the value 0x30 or 0x31 ONLY (to
represent '0' or '1')

If you are searching for the pattern 0100 in binary, it corresponds to
the patterns 0x30 0x31 0x30 0x30 in your char array representation

Searching a pattern on such a string becomes really easy even if it asks
for 8th more memory to store the file content
Oh, I see. How would you translate a binary file into such a string?

Thanks,
Michael

Feb 8 '07 #5
Daneel wrote:
Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael
See, e.g., http://en.wikipedia.org/wiki/Knuth-M...ratt_algorithm

You'll have better luck posting algorithm questions on comp.programming

-Mark
Feb 8 '07 #6
KoopaJah wrote:
Hello Daneel,

with properly trimmed quotes. See the majority of other posts in the
newsgroup, or the group FAQ list:
<http://www.parashift.com/c++-faq-lite/how-to-post.html>
Feb 8 '07 #7
On Feb 8, 9:17 am, "Daneel" <michael.ransb...@gmail.comwrote:
Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael
(Assuming an 8-bit byte) if your pattern is 8 bits or less, it might
help
you to first fill in these two arrays:

unsigned char unsplit[256], split[256][256];

where unsplit[val] gives the number of occurences of the pattern
in the binary representation of "val", and split[val1][val2] gives the
number of occurences of the pattern in the (16-bit) concatenation
of the binary representations of "val1"and "val2", minus
(unsplit[val1] + unsplit[val2]). If b1, b2, b3 ... are the bytes in
the file, then the solution is:

unsplit[b1] + split[b1][b2] + unsplit[b2] + split[b2][b3]
+ unsplit[b3] + split[b3][b4] + ...

The file might have to be several hundred Kbytes for the overhead
of generating the arrays to be worth it. The rate of cache evictions
of the arrays would also probably have a big influence on
performance.

Feb 8 '07 #8
On Feb 8, 4:12 pm, "W Karas" <wka...@yahoo.comwrote:
On Feb 8, 9:17 am, "Daneel" <michael.ransb...@gmail.comwrote:
Hello!
I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).
I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?
It would be great to get some input, possibly with sample source code.
Many thanks,
Michael

(Assuming an 8-bit byte) if your pattern is 8 bits or less, it might
help
you to first fill in these two arrays:

unsigned char unsplit[256], split[256][256];

where unsplit[val] gives the number of occurences of the pattern
in the binary representation of "val", and split[val1][val2] gives the
number of occurences of the pattern in the (16-bit) concatenation
of the binary representations of "val1"and "val2", minus
(unsplit[val1] + unsplit[val2]). If b1, b2, b3 ... are the bytes in
the file, then the solution is:

unsplit[b1] + split[b1][b2] + unsplit[b2] + split[b2][b3]
+ unsplit[b3] + split[b3][b4] + ...

The file might have to be several hundred Kbytes for the overhead
of generating the arrays to be worth it. The rate of cache evictions
of the arrays would also probably have a big influence on
performance.
Another method to do this would to be to make 8 copies of your search
pattern. Making each copy shifted by one bit. (Note, after 8 shifts
you have what you started with + 1 byte of zeros.)

Then perform a regular NxM comparison of all 8 copies on the input
data paying special attention to the last byte of the pattern. (First
shited copy you're going to want to only match 7 of 8 bits, second
shifted copy 6 of 8) and so on.

Then its just adding up the total matches for each shifted pattern.
The benefit you get from this are that you can apply any of the string
matching algorithms with a slight modification for the last byte match
(There are string matching algorithms that are sub-linear).

Hth,
Paul Davis

Feb 9 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

### Similar topics

 2 by: John Jørgensen | last post by: Hi How do I express - in XSD - that an element can contain a sequenced list of elements, and one of these elements may occur (0-n times) at BOTH sequence position x AND y? Example: ... 5 by: Larry | last post by: Dear all, I'm new to Python. I have a file (an image file actually) that I need to read pixel by pixel. It's an 8-bit integer type. I need to get the statistics like mean, standard deviation,... 0 by: veera ravala | last post by: ServiceNow is a powerful cloud-based platform that offers a wide range of services to help organizations manage their workflows, operations, and IT services more efficiently. At its core, ServiceNow... 0 by: VivesProcSPL | last post by: Obviously, one of the original purposes of SQL is to make data query processing easy. The language uses many English-like terms and syntax in an effort to make it easy to learn, particularly for... 0 by: mar23 | last post by: Here's the situation. I have a form called frmDiceInventory with subform called subfrmDice. The subform's control source is linked to a query called qryDiceInventory. I've been trying to pick up the... 0 by: abbasky | last post by: ### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method... 2 by: isladogs | last post by: The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE... 0 by: fareedcanada | last post by: Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like... 0 by: stefan129 | last post by: Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific... 0 by: egorbl4 | last post by: Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ... 0 by: MeoLessi9 | last post by: I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....