473,756 Members | 8,034 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Scanning a file

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100. I've tried with this:

# start
import sys

numChars = 0
startCode = 0
count = 0

inputFile = sys.stdin

while True:
ch = inputFile.read( 1)
numChars += 1

if len(ch) < 1: break

startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
if numChars < 4: continue

if startCode == 0x00000100:
count = count + 1

print count
# end

But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

/David

Oct 28 '05
79 5276
"Paul Watson" <pw*****@redlin epy.com> writes:
Here is a better one that counts, and not just detects, the substring. This
is -much- faster than using mmap; especially for a large file that may cause
paging to start. Using mmap can be -very- slow.

#!/usr/bin/env python
import sys

fn = 't2.dat'
ss = '\x00\x00\x01\x 00'

be = len(ss) - 1 # length of overlap to check
blocksize = 64 * 1024 # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksi ze)
count = 0
while len(b) > be:
count += b.count(ss)
b = b[-be:] + fp.read(blocksi ze)
fp.close()

print count
sys.exit(0)


Did you do timings on it vs. mmap? Having to copy the data multiple
times to deal with the overlap - thanks to strings being immutable -
would seem to be a lose, and makes me wonder how it could be faster
than mmap in general.

Thanks,
<mike
--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 29 '05 #21
Mike Meyer wrote:
Did you do timings on it vs. mmap? Having to copy the data multiple
times to deal with the overlap - thanks to strings being immutable -
would seem to be a lose, and makes me wonder how it could be faster
than mmap in general.


if you use "mmap" to read large files sequentially, without calling "madvise",
the system may try to keep old pages around just in case, and won't do as
much read-ahead as it can do. if you're low on memory, that means that
the system may waste some time swapping out data for other applications,
rather than throwing away data that you know that you will never look at
again.

if you have reasonably large files and you're not running on an overcrowded
machine, this is usually not a problem.

(does Python's mmap module support madvise, btw? doesn't look like it does...)

</F>

Oct 29 '05 #22
On 28 Oct 2005 06:51:36 -0700, "pi************ @gmail.com" <pi************ @gmail.com> wrote:
First of all, this isn't a text file, it is a binary file. Secondly,
substrings can overlap. In the sequence 0010010 the substring 0010
occurs twice.

ISTM you better let others know exactly what you mean by this, before
you use the various solutions suggested or your own ;-)

a) Are you talking about bit strings or byte strings?
b) Do you want to _count_ overlapping substrings??!!
Note result of s.count on your example:
s = '0010010'
s.count('0010') 1

vs. brute force counting overlapped substrings (not tested beyond what you see ;-)
def ovcount(s, sub): ... start = count = 0
... while True:
... start = s.find(sub, start) + 1
... if start==0: break
... count += 1
... return count
... ovcount(s, '0010')

2

Regards,
Bengt Richter
Oct 29 '05 #23
On Fri, 28 Oct 2005 20:03:17 -0700, al*****@yahoo.c om (Alex Martelli) wrote:
Mike Meyer <mw*@mired.or g> wrote:
...
Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either. To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.


That's the kind of things generators are for...:

def byblocks(f, blocksize, overlap):
block = f.read(blocksiz e)
yield block
while block:
block = block[-overlap:] + f.read(blocksiz e-overlap)
if block: yield block

Now, to look for a substring of length N in an open binary file f:

f = open(whatever, 'b')
count = 0
for block in byblocks(f, 1024*1024, len(subst)-1):
count += block.count(sub st)
f.close()

not much "fiddling" needed, as you can see, and what little "fiddling"
is needed is entirely encompassed by the generator...

Do I get a job at google if I find something wrong with the above? ;-)

Regards,
Bengt Richter
Oct 29 '05 #24
Bengt Richter wrote:
On Fri, 28 Oct 2005 20:03:17 -0700, al*****@yahoo.c om (Alex Martelli)
wrote:
Mike Meyer <mw*@mired.or g> wrote:
...
Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either. To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.


That's the kind of things generators are for...:

def byblocks(f, blocksize, overlap):
block = f.read(blocksiz e)
yield block
while block:
block = block[-overlap:] + f.read(blocksiz e-overlap)
if block: yield block

Now, to look for a substring of length N in an open binary file f:

f = open(whatever, 'b')
count = 0
for block in byblocks(f, 1024*1024, len(subst)-1):
count += block.count(sub st)
f.close()

not much "fiddling" needed, as you can see, and what little "fiddling"
is needed is entirely encompassed by the generator...

Do I get a job at google if I find something wrong with the above? ;-)


Try it with a subst of length 1. Seems like you missed an opportunity :-)

Peter

Oct 29 '05 #25
On Sat, 29 Oct 2005 10:34:24 +0200, Peter Otten <__*******@web. de> wrote:
Bengt Richter wrote:
On Fri, 28 Oct 2005 20:03:17 -0700, al*****@yahoo.c om (Alex Martelli)
wrote:
Mike Meyer <mw*@mired.or g> wrote:
...
Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either. To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.

That's the kind of things generators are for...:

def byblocks(f, blocksize, overlap):
block = f.read(blocksiz e)
yield block
while block:
block = block[-overlap:] + f.read(blocksiz e-overlap)
if block: yield block

Now, to look for a substring of length N in an open binary file f:

f = open(whatever, 'b')
count = 0
for block in byblocks(f, 1024*1024, len(subst)-1):
count += block.count(sub st)
f.close()

not much "fiddling" needed, as you can see, and what little "fiddling"
is needed is entirely encompassed by the generator...

Do I get a job at google if I find something wrong with the above? ;-)


Try it with a subst of length 1. Seems like you missed an opportunity :-)

I was thinking this was an example a la Alex's previous discussion
of interviewee code challenges ;-)

What struck me was
gen = byblocks(String IO.StringIO('no '),1024,len('en d?')-1)
[gen.next() for i in xrange(10)]

['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']

Regards,
Bengt Richter
Oct 29 '05 #26
Bengt Richter <bo**@oz.net> wrote:
...
while block:
block = block[-overlap:] + f.read(blocksiz e-overlap)
if block: yield block
...
I was thinking this was an example a la Alex's previous discussion
of interviewee code challenges ;-)

What struck me was
>>> gen = byblocks(String IO.StringIO('no '),1024,len('en d?')-1)
>>> [gen.next() for i in xrange(10)]

['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']


Heh, OK, I should get back into the habit of adding a "warning: untested
code" when I post code (particularly when it's late and I'm
jetlagged;-). The code I posted will never exit, since block always
keeps the last overlap bytes; it needs to be changed into something like
(warning -- untested code!-)

if overlap>0:
while True:
next = f.read(blocksiz e-overlap)
if not next: break
block = block[-overlap:] + next
yield block
else:
while True:
next = f.read(blocksiz e)
if not next: break
yield next

(the if/else is needed to handle requests for overlaps <= 0, if desired;
I think it's clearer to split the cases rather than to test inside the
loop's body).
Alex

Oct 29 '05 #27
Bengt Richter wrote:
What struck me was
*gen*=*byblo cks(StringIO.St ringIO('no'),10 24,len('end?')-1)
*[gen.next()*for* i*in*xrange(10)]

['no',*'no',*'no ',*'no',*'no',* 'no',*'no',*'no ',*'no',*'no']


Ouch. Seems like I spotted the subtle cornercase error and missed the big
one.

Peter

Oct 29 '05 #28
"Mike Meyer" <mw*@mired.or g> wrote in message
news:86******** ****@bhuda.mire d.org...
"Paul Watson" <pw*****@redlin epy.com> writes: .... Did you do timings on it vs. mmap? Having to copy the data multiple
times to deal with the overlap - thanks to strings being immutable -
would seem to be a lose, and makes me wonder how it could be faster
than mmap in general.


The only thing copied is a string one byte less than the search string for
each block.

I did not do due dilligence with respect to timings. Here is a small
dataset read sequentially and using mmap.

$ ls -lgG t.dat
-rw-r--r-- 1 16777216 Oct 28 16:32 t.dat
$ time ./scanfile.py
1048576
0.80s real 0.64s user 0.15s system
$ time ./scanfilemmap.py
1048576
20.33s real 6.09s user 14.24s system

With a larger file, the system time skyrockets. I assume that to be the
paging mechanism in the OS. This is Cyngwin on Windows XP.

$ ls -lgG t2.dat
-rw-r--r-- 1 268435456 Oct 28 16:33 t2.dat
$ time ./scanfile.py
16777216
28.85s real 16.37s user 0.93s system
$ time ./scanfilemmap.py
16777216
323.45s real 94.45s user 227.74s system
Oct 29 '05 #29
I think implementing a finite state automaton would be a good (best?)
solution. I have drawn a FSM for you (try viewing the following in
fixed width font). Just increment the count when you reach state 5.

<---------------|
| |
0 0 | 1 0 |0
-->[1]--->[2]--->[3]--->[4]--->[5]-|
^ | | ^ | | |
1| |<---| | | |1 |1
|_| 1 |_| | |
^ 0 | |
|---------------------|<-----|

If you don't understand FSM's, try getting a book on computational
theory (the book by Hopcroft & Ullman is great.)

Here you don't have special cases whether reading in blocks or reading
whole at once (as you only need one byte at a time).

Vaibhav

Oct 29 '05 #30

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

21
12212
by: CHANGE username to westes | last post by:
What are the most popular, and well supported, libraries of drivers for bar code scanners that include a Visual Basic and C/C++ API? My requirements are: - Must allow an application to be written to a single interface, but support many different manufacturers' barcode scanning devices. I do not want to be tied to one manufacturers' software interfaces. - Must support use of the scanner from Visual Basic, and ideally from C/C++ and...
4
3239
by: Zen | last post by:
I'm using Access 2000, and I'd like to know if there is a way to use a scanner (flatbed, doc-feed, etc) to scan forms with OMR or OCR software, and have the data be automatically (or if not automatically then using a macro or other means) entered into tables. I guess the real question is do I need to use an expensive program to do this or is it codable suing Access/VB, and if it is codable, any suggestions as to how to start? Many...
8
2450
by: Marie-Christine Bechara | last post by:
I have a form with a button called btnScan. When i click on this button i want to scan a file and save it in the database. Any hints?? ideas??? solutions??? *** Sent via Developersdex http://www.developersdex.com *** Don't just participate in USENET...get rewarded for it!
3
1305
by: Brent Burkart | last post by:
I am using a streamreader to read a log file into my application. Now I want to be able to scan for error messages, such as "failed", "error", "permission denied", so I can take action such as send an email. I am not quite sure how to approach this as far as scanning the content. I currently read all of the contents in using the following Dim contents As String = objStreamReader.ReadToEnd()
6
3875
by: Bob Alston | last post by:
I am looking for others who have built systems to scan documents, index them and then make them accessible from an Access database. My environment is a nonprofit with about 20-25 case workers who use laptops. They have Access databases on their laptops and the data is replicated. The idea is that each case worker would scan their own documents, either remotely or back at the office. And NO I am not planning to store the scanned...
4
1376
by: tshad | last post by:
We have a few pages that accept uploads and want to scan the files before accepting them. Does Asp.net have a way of doing a virus scan? We are using Trendmicro to scan files and email but don't know if we can use it with our pages to handle files that our clients upload. Is there some type of API that would allow us to do this? I want to be able to Upload Word files using: <input id="MyFile" visible="true" style="width:200px"...
1
1538
kirubagari
by: kirubagari | last post by:
For i = 49 To mfilesize Step 6 rich1.SelStart = Len(rich1.Text) rich1.SelText = "Before : " & HexByte2Char(arrByte(i)) & _ " " & HexByte2Char(arrByte(i + 1)) & " " _ & HexByte2Char(arrByte(i + 2)) & " " _ & HexByte2Char(arrByte(i + 3)) & " " _ & HexByte2Char(arrByte(i + 4)) & " " _
23
3751
by: Rotsey | last post by:
Hi, I am writing an app that scans hard drives and logs info about every fine on the drive. The first iteration of my code used a class and a generic list to store the data and rhis took 13min on my 60 GB drive. I wanted it to be quicker.
2
4395
by: iheartvba | last post by:
Hi Guys, I have been using EzTwain Pro to scan documents into my access program. It allows me to specify the location I want the Doc to go to. It also allows me to set the name of the document as well. The link to the program is as below : EZTwain imaging library system - add TWAIN scanning or image capture to your application. I'm not sure if it's the nature of the program, but the scanning module is very slow to load. Otherwise it's...
0
9456
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10040
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9873
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9846
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9713
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8713
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7248
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6534
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5304
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.