By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,668 Members | 1,377 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,668 IT Pros & Developers. It's quick & easy.

Bit substring search

P: n/a
I am trying to parse a bit-stream file format (bzip2) that does not have
byte-aligned record boundaries, so I need to do efficient matching of
bit substrings at arbitrary bit offsets.

Is there a package that can do this? This one comes close:

http://ilan.schnell-web.net/prog/bitarray/

but it only supports single bit substring match.

Kris
Jun 27 '08 #1
Share this Question
Share on Google+
5 Replies

P: n/a
Kris Kennaway:
I am trying to parse a bit-stream file format (bzip2) that does not have
byte-aligned record boundaries, so I need to do efficient matching of
bit substrings at arbitrary bit offsets.
Is there a package that can do this?
You may take a look at Hachoir or some other modules:
http://hachoir.org/wiki/hachoir-core
http://pypi.python.org/pypi/construct/2.00
http://pypi.python.org/pypi/FmtRW/20040603
Etc. More:
http://pypi.python.org/pypi?%3Aactio...ch&term=binary

Bye,
bearophile
Jun 27 '08 #2

P: n/a
be************@lycos.com wrote:
Kris Kennaway:
>I am trying to parse a bit-stream file format (bzip2) that does not have
byte-aligned record boundaries, so I need to do efficient matching of
bit substrings at arbitrary bit offsets.
Is there a package that can do this?

You may take a look at Hachoir or some other modules:
http://hachoir.org/wiki/hachoir-core
http://pypi.python.org/pypi/construct/2.00
Thanks. hachoir also comes close, but it also doesnt seem to be able to
match substrings at a bit level (e.g. the included bzip2 parser just
reads the header and hands the entire file off to libbzip2 to extract
data from).

construct exports a bit stream but it's again pure python and matching
substrings will be slow. It will need C support to do that efficiently.
http://pypi.python.org/pypi/FmtRW/20040603
Etc. More:
http://pypi.python.org/pypi?%3Aactio...ch&term=binary
Unfortunately I didnt find anything else useful here yet :(

Kris

Jun 27 '08 #3

P: n/a
Kris Kennaway:
Unfortunately I didnt find anything else useful here yet :(
I see, I'm sorry, I have found hachoir quite nice in the past. Maybe
there's no really efficient way to do it with Python, but you can
create a compiled extension, so you can see if it's fast enough for
your purposes.
To create such extension you can:
- One thing that requires very little time is to create an extension
with ShedSkin, once installed it just needs Python code.
- Cython (ex-Pyrex) too may be okay, but it's a bit trikier on Windows
machines.
- Using Pyd to create a D extension for Python is often the faster way
I have found to create extensions. I need just few minutes to create
them this way. But you need to know a bit of D.
- Then, if you want you can write a C extension, but if you have not
done it before you may need some hours to make it work.

Bye,
bearophile
Jun 27 '08 #4

P: n/a
be************@lycos.com wrote:
Kris Kennaway:
>Unfortunately I didnt find anything else useful here yet :(

I see, I'm sorry, I have found hachoir quite nice in the past. Maybe
there's no really efficient way to do it with Python, but you can
create a compiled extension, so you can see if it's fast enough for
your purposes.
To create such extension you can:
- One thing that requires very little time is to create an extension
with ShedSkin, once installed it just needs Python code.
- Cython (ex-Pyrex) too may be okay, but it's a bit trikier on Windows
machines.
- Using Pyd to create a D extension for Python is often the faster way
I have found to create extensions. I need just few minutes to create
them this way. But you need to know a bit of D.
- Then, if you want you can write a C extension, but if you have not
done it before you may need some hours to make it work.
Thanks for the pointers, I think a C extension will end up being the way
to go, unless someone has beaten me to it and I just haven't found it yet.

Kris
Jun 27 '08 #5

P: n/a
Scott David Daniels wrote:
Kris Kennaway wrote:
>Thanks for the pointers, I think a C extension will end up being the
way to go, unless someone has beaten me to it and I just haven't found
it yet.

Depending on the pattern length you are targeting, it may be fastest to
increase the out-of-loop work. For a 40-bit string, build an 8-target
Aho-Corasick machine, and at each match check the endpoints. This will
only work well if 40 bits is at the low end of what you are hunting for.
Thanks, I wasn't aware of Aho-Corasick.

Kris

Jun 27 '08 #6

This discussion thread is closed

Replies have been disabled for this discussion.