473,666 Members | 2,480 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Find all occurences of a bit sequence in a file

Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael

Feb 8 '07 #1
8 5636
Hello Daneel,

An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!

Don't know if you can use this method,

Remi J.

Daneel a écrit :
Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael
Feb 8 '07 #2
An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!
Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?

All the best,
Michael

Feb 8 '07 #3
Daneel a écrit :
>An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!

Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?
Hmmm i guess i did not explain myself properly. If you represent each
bit as a character, you do not need to be char/byte aligned. If your
binary content of your file is 000100110011010 0, you need to represent
this binary as an array of 16 characters (so 16*8 bits) each char
corresponding to one of your bit with the value 0x30 or 0x31 ONLY (to
represent '0' or '1')

If you are searching for the pattern 0100 in binary, it corresponds to
the patterns 0x30 0x31 0x30 0x30 in your char array representation
Searching a pattern on such a string becomes really easy even if it asks
for 8th more memory to store the file content

Sincerely,

Remi J.
All the best,
Michael
Feb 8 '07 #4
An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!
Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?

Hmmm i guess i did not explain myself properly. If you represent each
bit as a character, you do not need to be char/byte aligned. If your
binary content of your file is 000100110011010 0, you need to represent
this binary as an array of 16 characters (so 16*8 bits) each char
corresponding to one of your bit with the value 0x30 or 0x31 ONLY (to
represent '0' or '1')

If you are searching for the pattern 0100 in binary, it corresponds to
the patterns 0x30 0x31 0x30 0x30 in your char array representation

Searching a pattern on such a string becomes really easy even if it asks
for 8th more memory to store the file content
Oh, I see. How would you translate a binary file into such a string?

Thanks,
Michael

Feb 8 '07 #5
Daneel wrote:
Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael
See, e.g., http://en.wikipedia.org/wiki/Knuth-M...ratt_algorithm

You'll have better luck posting algorithm questions on comp.programmin g

-Mark
Feb 8 '07 #6
KoopaJah wrote:
Hello Daneel,

Please don't top-post. Your replies belong following or interspersed
with properly trimmed quotes. See the majority of other posts in the
newsgroup, or the group FAQ list:
<http://www.parashift.c om/c++-faq-lite/how-to-post.html>
Feb 8 '07 #7
On Feb 8, 9:17 am, "Daneel" <michael.ransb. ..@gmail.comwro te:
Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael
(Assuming an 8-bit byte) if your pattern is 8 bits or less, it might
help
you to first fill in these two arrays:

unsigned char unsplit[256], split[256][256];

where unsplit[val] gives the number of occurences of the pattern
in the binary representation of "val", and split[val1][val2] gives the
number of occurences of the pattern in the (16-bit) concatenation
of the binary representations of "val1"and "val2", minus
(unsplit[val1] + unsplit[val2]). If b1, b2, b3 ... are the bytes in
the file, then the solution is:

unsplit[b1] + split[b1][b2] + unsplit[b2] + split[b2][b3]
+ unsplit[b3] + split[b3][b4] + ...

The file might have to be several hundred Kbytes for the overhead
of generating the arrays to be worth it. The rate of cache evictions
of the arrays would also probably have a big influence on
performance.

Feb 8 '07 #8
On Feb 8, 4:12 pm, "W Karas" <wka...@yahoo.c omwrote:
On Feb 8, 9:17 am, "Daneel" <michael.ransb. ..@gmail.comwro te:
Hello!
I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).
I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?
It would be great to get some input, possibly with sample source code.
Many thanks,
Michael

(Assuming an 8-bit byte) if your pattern is 8 bits or less, it might
help
you to first fill in these two arrays:

unsigned char unsplit[256], split[256][256];

where unsplit[val] gives the number of occurences of the pattern
in the binary representation of "val", and split[val1][val2] gives the
number of occurences of the pattern in the (16-bit) concatenation
of the binary representations of "val1"and "val2", minus
(unsplit[val1] + unsplit[val2]). If b1, b2, b3 ... are the bytes in
the file, then the solution is:

unsplit[b1] + split[b1][b2] + unsplit[b2] + split[b2][b3]
+ unsplit[b3] + split[b3][b4] + ...

The file might have to be several hundred Kbytes for the overhead
of generating the arrays to be worth it. The rate of cache evictions
of the arrays would also probably have a big influence on
performance.
Another method to do this would to be to make 8 copies of your search
pattern. Making each copy shifted by one bit. (Note, after 8 shifts
you have what you started with + 1 byte of zeros.)

Then perform a regular NxM comparison of all 8 copies on the input
data paying special attention to the last byte of the pattern. (First
shited copy you're going to want to only match 7 of 8 bits, second
shifted copy 6 of 8) and so on.

Then its just adding up the total matches for each shifted pattern.
The benefit you get from this are that you can apply any of the string
matching algorithms with a slight modification for the last byte match
(There are string matching algorithms that are sub-linear).

Hth,
Paul Davis

Feb 9 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
4564
by: John Jørgensen | last post by:
Hi How do I express - in XSD - that an element can contain a sequenced list of elements, and one of these elements may occur (0-n times) at BOTH sequence position x AND y? Example: <PRIORITY sorting="900"/> <NAMEINFO> subtags and data
108
6369
by: Bryan Olson | last post by:
The Python slice type has one method 'indices', and reportedly: This method takes a single integer argument /length/ and computes information about the extended slice that the slice object would describe if applied to a sequence of length items. It returns a tuple of three integers; respectively these are the /start/ and /stop/ indices and the /step/ or stride length of the slice. Missing or out-of-bounds indices are handled in a manner...
4
13637
by: Jason Gleason | last post by:
What's the most efficient way to get the number of occurences of a certain string in another string..for instance i'm using the following code right now... private int CharacterCounter(String text,String Character) { int count = 0;
4
3370
by: Dameon | last post by:
Hi All, I have a process where I'd like to search the contents of a file(in a dir) for all occurences (or the count of) of a given string. My goal is to focus more on performance, as some of the files could be upwards of 25mb in size and time is important. I don't want to take the route of loading the text of the file into a giant string and searching it, but would rather focus on a performance-minded solution. Any sugesstions for a...
1
3026
davydany
by: davydany | last post by:
Hey guys...a n00b Here for this site. I'm making a sequence class for my C++ class. And The thing is in the array that I have, lets say i put in {13,17,38,18}, when i see the current values for the array data from 0 to3, I get this {13, JUNK VALUE, 17,38, 18} and JUNK VALUE is like 1.8e831 or something like that. This happens when I use the attach() function and use the current() function to display the values at data I really want to...
2
1960
by: kasala | last post by:
I get an xml document as input from other department. The input xml document i recieve has a particular word "rnx" which should not be there and my system doesn't support it. And there is also attribute xml:lang="EN" in some elements which should not be there. I am enclosing the code for reference. I used to remove the occurences manually...but now I need an assembly which can do that for me in C#.I have to remove the occurences of the word...
18
3865
by: Neehar | last post by:
Hello For one of the interviews I took recently, I was given an offline programming quiz. In 30 minutes I had to write code in C++ to counts the number of times each unique word appears in a given file. I tried my level best even after the quiz to come up with a solution but cudnt find an efficient one. :( This is what I did.
0
1569
by: Rajgodfather | last post by:
I am getting the end couldn't error while validating xml file against xsd. The xml file looks perfect. XML: <event id="1"> <!-- Successful event. --> <AffectedBillingComponentsRequest
5
6924
by: Larry | last post by:
Dear all, I'm new to Python. I have a file (an image file actually) that I need to read pixel by pixel. It's an 8-bit integer type. I need to get the statistics like mean, standard deviation, etc., which I know a little bit already from reading numpy module. What I want to know is how to get the number of occurences of numeric element in an array. Say, b = array(()
0
8362
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8785
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8644
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7389
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6200
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4372
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2776
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2012
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
2
1778
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.