473,770 Members | 2,143 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Scanning a file

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100. I've tried with this:

# start
import sys

numChars = 0
startCode = 0
count = 0

inputFile = sys.stdin

while True:
ch = inputFile.read( 1)
numChars += 1

if len(ch) < 1: break

startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
if numChars < 4: continue

if startCode == 0x00000100:
count = count + 1

print count
# end

But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

/David

Oct 28 '05
79 5289
Peter Otten wrote:
Bengt Richter wrote:

What struck me was

> gen = byblocks(String IO.StringIO('no '),1024,len('en d?')-1)
> [gen.next() for i in xrange(10)]


['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']

Ouch. Seems like I spotted the subtle cornercase error and missed the big
one.


No, you just realised subconsciously that we'd all spot the obvious one
and decided to point out the bug that would remain after the obvious one
had been fixed.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Oct 29 '05 #31
Steven D'Aprano <st***@REMOVETH IScyber.com.au> wrote:

On Fri, 28 Oct 2005 15:29:46 +0200, Björn Lindström wrote:
"pi************ @gmail.com" <pi************ @gmail.com> writes:
f = open("filename" , "rb")
s = f.read()
sub = "\x00\x00\x01\x 00"
count = s.count(sub)
print count


That's a lot of lines. This is a bit off topic, but I just can't stand
unnecessary local variables.

print file("filename" , "rb").read().co unt("\x00\x00\x 01\x00")


Funny you should say that, because I can't stand unnecessary one-liners.

In any case, you are assuming that Python will automagically close the
file when you are done.


Nonsense. This behavior is deterministic. At the end of that line, the
anonymous file object out of scope, the object is deleted, and the file is
closed.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Oct 29 '05 #32
Paul Watson wrote:
Here is a better one that counts, and not just detects, the substring. This
is -much- faster than using mmap; especially for a large file that may cause
paging to start. Using mmap can be -very- slow.

<ss = pattern, be = len(ss) - 1>
...
b = fp.read(blocksi ze)
count = 0
while len(b) > be:
count += b.count(ss)
b = b[-be:] + fp.read(blocksi ze)
...

In cases where that one wins and blocksize is big,
this should do even better:
...
block = fp.read(blocksi ze)
count = 0
while len(block) > be:
count += block.count(ss)
lead = block[-be :]
block = fp.read(blocksi ze)
count += (lead + block[: be]).count(ss)
...
--
-Scott David Daniels
sc***********@a cm.org
Oct 29 '05 #33
Tim Roberts <ti**@probo.com > wrote:
...
print file("filename" , "rb").read().co unt("\x00\x00\x 01\x00")


Funny you should say that, because I can't stand unnecessary one-liners.

In any case, you are assuming that Python will automagically close the
file when you are done.


Nonsense. This behavior is deterministic. At the end of that line, the
anonymous file object out of scope, the object is deleted, and the file is
closed.


In today's implementations of Classic Python, yes. In other equally
valid implementations of the language, such as Jython, IronPython, or,
for all we know, some future implementation of Classic, that may well
not be the case. Many, quite reasonably, dislike relying on a specific
implementation' s peculiarities, and prefer to write code that relies
only on what the _language_ specs guarantee.
Alex
Oct 29 '05 #34
ne********@gmai l.com wrote:
I think implementing a finite state automaton would be a good (best?)
solution. I have drawn a FSM for you (try viewing the following in
fixed width font). Just increment the count when you reach state 5.

<---------------|
| |
0 0 | 1 0 |0
-->[1]--->[2]--->[3]--->[4]--->[5]-|
^ | | ^ | | |
1| |<---| | | |1 |1
|_| 1 |_| | |
^ 0 | |
|---------------------|<-----|

If you don't understand FSM's, try getting a book on computational
theory (the book by Hopcroft & Ullman is great.)

Here you don't have special cases whether reading in blocks or reading
whole at once (as you only need one byte at a time).

Indeed, but reading one byte at a time is about the slowest way to
process a file, in Python or any other language, because it fails to
amortize the overhead cost of function calls over many characters.

Buffering wasn't invented because early programmers had nothing better
to occupy their minds, remember :-)

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Oct 29 '05 #35
"Alex Martelli" <al*****@yahoo. com> wrote in message
news:1h5760l.1e 2eatkurdeo7N%al *****@yahoo.com ...
In today's implementations of Classic Python, yes. In other equally
valid implementations of the language, such as Jython, IronPython, or,
for all we know, some future implementation of Classic, that may well
not be the case. Many, quite reasonably, dislike relying on a specific
implementation' s peculiarities, and prefer to write code that relies
only on what the _language_ specs guarantee.


How could I identify when Python code does not close files and depends on
the runtime to take care of this? I want to know that the code will work
well under other Python implementations and future implementations which may
not have this provided.
Oct 29 '05 #36
"Paul Watson" <pw*****@redlin epy.com> writes:
How could I identify when Python code does not close files and depends on
the runtime to take care of this? I want to know that the code will work
well under other Python implementations and future implementations which may
not have this provided.


There is nothing in the Python language reference that guarantees the
files will be closed when the references go out of scope. That
CPython does it is simply an implementation artifact. If you want to
make sure they get closed, you have to close them explicitly. There
are some Python language extensions in the works to make this more
convenient (PEP 343) but for now you have to do it by hand.
Oct 29 '05 #37
ne********@gmai l.com wrote:
I think implementing a finite state automaton would be a good (best?)
solution. I have drawn a FSM for you (try viewing the following in
fixed width font). Just increment the count when you reach state 5.

<---------------|
| |
0 0 | 1 0 |0
-->[1]--->[2]--->[3]--->[4]--->[5]-|
^ | | ^ | | |
1| |<---| | | |1 |1
|_| 1 |_| | |
^ 0 | |
|---------------------|<-----|

If you don't understand FSM's, try getting a book on computational
theory (the book by Hopcroft & Ullman is great.)


I already have that book. The above solution very slow in practice. None
of the solutions presented in this thread is nearly as fast as the

print file("filename" , "rb").read().co unt("\x00\x00\x 01\x00")

/David
Oct 29 '05 #38
On Sat, 29 Oct 2005 21:08:09 +0000, Tim Roberts wrote:
In any case, you are assuming that Python will automagically close the
file when you are done.


Nonsense. This behavior is deterministic. At the end of that line, the
anonymous file object out of scope, the object is deleted, and the file is
closed.


That is an implementation detail. CPython may do that, but JPython does
not -- or at least it did not last time I looked. JPython doesn't
guarantee that the file will be closed at any particular time, just that
it will be closed eventually.

If all goes well. What if you have a circular dependence and the file
reference never gets garbage-collected? What happens if the JPython
process gets killed before the file is closed? You might not care about
one file not being released, but what if it is hundreds of files?

In general, it is best practice to release external resources as soon as
you're done with them, and not rely on a garbage collector which may or
may not release them in a timely manner.

There are circumstances where things do not go well and the file never
gets closed cleanly -- for example, when your disk is full, and the
buffer is only written to disk when you close the file. Would you
prefer that error to raise an exception, or to pass silently? If you want
close exceptions to pass silently, then by all means rely on the garbage
collector to close the file.

You might not care about these details in a short script -- when I'm
writing a use-once-and-throw-away script, that's what I do. But it isn't
best practice: explicit is better than implicit.

I should also point out that for really serious work, the idiom:

f = file("parrot")
handle(f)
f.close()

is insufficiently robust for production level code. That was a detail I
didn't think I needed to drop on the original newbie poster, but depending
on how paranoid you are, or how many exceptions you want to insulate the
user from, something like this might be needed:

try:
f = file("parrot")
try:
handle(f)
finally:
try:
f.close()
except:
print "The file could not be closed; see your sys admin."
except:
print "The file could not be opened."
--
Steven.

Oct 29 '05 #39
"Paul Watson" <pw*****@redlin epy.com> writes:
"Mike Meyer" <mw*@mired.or g> wrote in message
news:86******** ****@bhuda.mire d.org...
"Paul Watson" <pw*****@redlin epy.com> writes:

...
Did you do timings on it vs. mmap? Having to copy the data multiple
times to deal with the overlap - thanks to strings being immutable -
would seem to be a lose, and makes me wonder how it could be faster
than mmap in general.


The only thing copied is a string one byte less than the search string for
each block.


Um - you removed the code, but I could have *sworn* that it did
something like:

buf = buf[testlen:] + f.read(bufsize - testlen)

which should cause the the creation of three strings: the last few
bytes of the old buffer, a new bufferfull from the read, then the sum
of those two - created by copying the first two into a new string. So
you wind up copying all the data.

Which, as you showed, doesn't take nearly as much time as using mmap.

Thanks,
<mike
--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 29 '05 #40

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

21
12216
by: CHANGE username to westes | last post by:
What are the most popular, and well supported, libraries of drivers for bar code scanners that include a Visual Basic and C/C++ API? My requirements are: - Must allow an application to be written to a single interface, but support many different manufacturers' barcode scanning devices. I do not want to be tied to one manufacturers' software interfaces. - Must support use of the scanner from Visual Basic, and ideally from C/C++ and...
4
3242
by: Zen | last post by:
I'm using Access 2000, and I'd like to know if there is a way to use a scanner (flatbed, doc-feed, etc) to scan forms with OMR or OCR software, and have the data be automatically (or if not automatically then using a macro or other means) entered into tables. I guess the real question is do I need to use an expensive program to do this or is it codable suing Access/VB, and if it is codable, any suggestions as to how to start? Many...
8
2451
by: Marie-Christine Bechara | last post by:
I have a form with a button called btnScan. When i click on this button i want to scan a file and save it in the database. Any hints?? ideas??? solutions??? *** Sent via Developersdex http://www.developersdex.com *** Don't just participate in USENET...get rewarded for it!
3
1305
by: Brent Burkart | last post by:
I am using a streamreader to read a log file into my application. Now I want to be able to scan for error messages, such as "failed", "error", "permission denied", so I can take action such as send an email. I am not quite sure how to approach this as far as scanning the content. I currently read all of the contents in using the following Dim contents As String = objStreamReader.ReadToEnd()
6
3875
by: Bob Alston | last post by:
I am looking for others who have built systems to scan documents, index them and then make them accessible from an Access database. My environment is a nonprofit with about 20-25 case workers who use laptops. They have Access databases on their laptops and the data is replicated. The idea is that each case worker would scan their own documents, either remotely or back at the office. And NO I am not planning to store the scanned...
4
1376
by: tshad | last post by:
We have a few pages that accept uploads and want to scan the files before accepting them. Does Asp.net have a way of doing a virus scan? We are using Trendmicro to scan files and email but don't know if we can use it with our pages to handle files that our clients upload. Is there some type of API that would allow us to do this? I want to be able to Upload Word files using: <input id="MyFile" visible="true" style="width:200px"...
1
1539
kirubagari
by: kirubagari | last post by:
For i = 49 To mfilesize Step 6 rich1.SelStart = Len(rich1.Text) rich1.SelText = "Before : " & HexByte2Char(arrByte(i)) & _ " " & HexByte2Char(arrByte(i + 1)) & " " _ & HexByte2Char(arrByte(i + 2)) & " " _ & HexByte2Char(arrByte(i + 3)) & " " _ & HexByte2Char(arrByte(i + 4)) & " " _
23
3752
by: Rotsey | last post by:
Hi, I am writing an app that scans hard drives and logs info about every fine on the drive. The first iteration of my code used a class and a generic list to store the data and rhis took 13min on my 60 GB drive. I wanted it to be quicker.
2
4395
by: iheartvba | last post by:
Hi Guys, I have been using EzTwain Pro to scan documents into my access program. It allows me to specify the location I want the Doc to go to. It also allows me to set the name of the document as well. The link to the program is as below : EZTwain imaging library system - add TWAIN scanning or image capture to your application. I'm not sure if it's the nature of the program, but the scanning module is very slow to load. Otherwise it's...
0
10260
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10102
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10038
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8933
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7460
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5354
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5482
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4007
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3609
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.