python vs. grep

Anton Slesarev

I've read great paper about generators:
http://www.dabeaz.com/generators/index.html

Author say that it's easy to write analog of common linux tools such
as awk,grep etc. He say that performance could be even better.

But I have some problem with writing performance grep analog.
It's my script:

import re
pat = re.compile("sometext")

f = open("bigfile",'r')

flines = (line for line in f if pat.search(line))
c=0
for x in flines:
c+=1
print c

and bash:
grep "sometext" bigfile | wc -l

Python code 3-4 times slower on windows. And as I remember on linux
the same situation...

Buffering in open even increase time.

Is it possible to increase file reading performance?

Jun 27 '08 #1

Subscribe Post Reply

10072

Ian Kelly

On Tue, May 6, 2008 at 1:42 PM, Anton Slesarev <sl************@gmail.comwrote:

Is it possible to increase file reading performance?

Dunno about that, but this part:

flines = (line for line in f if pat.search(line))
c=0
for x in flines:
c+=1
print c

could be rewritten as just:

print sum(1 for line in f if pat.search(line))

Jun 27 '08 #2

Arnaud Delobelle

Anton Slesarev <sl************@gmail.comwrites:

f = open("bigfile",'r')

flines = (line for line in f if pat.search(line))
c=0
for x in flines:
c+=1
print c

It would be simpler (and probably faster) not to use a generator expression:

search = re.compile('sometext').search

c = 0
for line in open('bigfile'):
if search(line):
c += 1

Perhaps faster (because the number of name lookups is reduced), using
itertools.ifilter:

from itertools import ifilter

c = 0
for line in ifilter(search, 'bigfile'):
c += 1
If 'sometext' is just text (no regexp wildcards) then even simpler:

....
for line in ...:
if 'sometext' in line:
c += 1

I don't believe you'll easily beat grep + wc using Python though.

Perhaps faster?

sum(bool(search(line)) for line in open('bigfile'))
sum(1 for line in ifilter(search, open('bigfile')))

....etc...

All this is untested!
--
Arnaud

Jun 27 '08 #3

Wojciech Walczak

2008/5/6, Anton Slesarev <sl************@gmail.com>:

But I have some problem with writing performance grep analog.

[...]

Python code 3-4 times slower on windows. And as I remember on linux
the same situation...

Buffering in open even increase time.

Is it possible to increase file reading performance?

The best advice would be not to try to beat grep, but if you really
want to, this is the right place ;)

Here is my code:
$ cat grep.py
import sys

if len(sys.argv) != 3:
print 'grep.py <pattern<file>'
sys.exit(1)

f = open(sys.argv[2],'r')

print ''.join((line for line in f if sys.argv[1] in line)),

$ ls -lh debug.0
-rw-r----- 1 gminick root 4,1M 2008-05-07 00:49 debug.0

---
$ time grep nusia debug.0 |wc -l
26009

real 0m0.042s
user 0m0.020s
sys 0m0.004s
---

---
$ time python grep.py nusia debug.0 |wc -l
26009

real 0m0.077s
user 0m0.044s
sys 0m0.016s
---

---
$ time grep nusia debug.0

real 0m3.163s
user 0m0.016s
sys 0m0.064s
---

---
$ time python grep.py nusia debug.0
[26009 lines here...]
real 0m2.628s
user 0m0.032s
sys 0m0.064s
---

So, printing the results take 2.6 secs for python and 3.1s for original grep.
Suprised? The only reason for this is that we have reduced the number
of write calls in the python example:

$ strace -ooriggrep.log grep nusia debug.0
$ grep write origgrep.log |wc -l
26009
$ strace -opygrep.log python grep.py nusia debug.0
$ grep write pygrep.log |wc -l
12
Wish you luck saving your CPU cycles :)

--
Regards,
Wojtek Walczak
http://www.stud.umk.pl/~wojtekwa/

Jun 27 '08 #4

Ville Vainio

On May 6, 10:42 pm, Anton Slesarev <slesarev.an...@gmail.comwrote:

flines = (line for line in f if pat.search(line))

What about re.findall() / re.finditer() for the whole file contents?

Jun 27 '08 #5

Pop User

Anton Slesarev wrote:

>
But I have some problem with writing performance grep analog.

I don't think you can ever catch grep. Searching is its only purpose in
life and its very good at it. You may be able to come closer, this
thread relates.

http://groups.google.com/group/comp....476da5d7a9e466

This relates to the speed of re. If you don't need regex don't use re.
If you do need re an alternate re library might be useful but you
aren't going to catch grep.

Jun 27 '08 #6

Anton Slesarev

On May 7, 7:22 pm, Pop User <popu...@christest2.dc.k12us.comwrote:

Anton Slesarev wrote:

But I have some problem with writing performance grep analog.

I don't think you can ever catch grep. Searching is its only purpose in
life and its very good at it. You may be able to come closer, this
thread relates.

http://groups.google.com/group/comp....thread/thread/...

This relates to the speed of re. If you don't need regex don't use re.
If you do need re an alternate re library might be useful but you
aren't going to catch grep.

In my last test I dont use re. As I understand the main problem in
reading file.

Jun 27 '08 #7

Alan Isaac

Anton Slesarev wrote:

I've read great paper about generators:
http://www.dabeaz.com/generators/index.html
Author say that it's easy to write analog of common linux tools such
as awk,grep etc. He say that performance could be even better.
But I have some problem with writing performance grep analog.

https://svn.enthought.com/svn/sandbox/grin/trunk/

hth,
Alan Isaac

Jun 27 '08 #8

Robert Kern

Alan Isaac wrote:

Anton Slesarev wrote:
>I've read great paper about generators:
http://www.dabeaz.com/generators/index.html Author say that it's easy
to write analog of common linux tools such as awk,grep etc. He say
that performance could be even better. But I have some problem with
writing performance grep analog.

https://svn.enthought.com/svn/sandbox/grin/trunk/

As the author of grin I can definitively state that it is not at all competitive
with grep in terms of speed. grep reads files really fast. awk is probably
beatable, though.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Jun 27 '08 #9

Ville Vainio

On May 8, 8:11 pm, Ricardo Aráoz <ricar...@gmail.comwrote:

All these examples assume your regular expression will not span multiple
lines, but this can easily be the case. How would you process the file
with regular expressions that span multiple lines?

re.findall/ finditer, as I said earlier.

Jun 27 '08 #10

=?ISO-8859-1?Q?Ricardo_Ar=E1oz?=

Ville Vainio wrote:

On May 8, 8:11 pm, Ricardo Aráoz <ricar...@gmail.comwrote:

>All these examples assume your regular expression will not span multiple
lines, but this can easily be the case. How would you process the file
with regular expressions that span multiple lines?

re.findall/ finditer, as I said earlier.

Hi, sorry took so long to answer. Too much work.

findall/finditer do not address the issue, they merely find ALL the
matches in a STRING. But if you keep reading the files a line at a time
(as most examples given in this thread do) then you are STILL in trouble
when a regular expression spans multiple lines.
The easy/simple (too easy/simple?) way I see out of it is to read THE
WHOLE file into memory and don't worry. But what if the file is too
heavy? So I was wondering if there is any other way out of it. Does grep
read the whole file into memory? Does it ONLY process a line at a time?

Jun 27 '08 #11

Kam-Hung Soh

On Tue, 13 May 2008 00:03:08 +1000, Ricardo Aráoz <ri******@gmail.com>
wrote:

Ville Vainio wrote:
>On May 8, 8:11 pm, Ricardo Aráoz <ricar...@gmail.comwrote:

>>All these examples assume your regular expression will not span
multiple
lines, but this can easily be the case. How would you process the file
with regular expressions that span multiple lines?
re.findall/ finditer, as I said earlier.

Hi, sorry took so long to answer. Too much work.

findall/finditer do not address the issue, they merely find ALL the
matches in a STRING. But if you keep reading the files a line at a time
(as most examples given in this thread do) then you are STILL in trouble
when a regular expression spans multiple lines.
The easy/simple (too easy/simple?) way I see out of it is to read THE
WHOLE file into memory and don't worry. But what if the file is too
heavy? So I was wondering if there is any other way out of it. Does grep
read the whole file into memory? Does it ONLY process a line at a time?

--
http://mail.python.org/mailman/listinfo/python-list

Standard grep can only match a line at a time. Are you thinking about
"sed", which has a sliding window?

See http://www.gnu.org/software/sed/manual/sed.html, Section 4.13

--
Kam-Hung Soh <a href="http://kamhungsoh.com/blog">Software Salariman</a>

Jun 27 '08 #12

Ville M. Vainio

Ricardo Aráoz <ri******@gmail.comwrites:

The easy/simple (too easy/simple?) way I see out of it is to read THE
WHOLE file into memory and don't worry. But what if the file is too

The easiest and simplest approach is often the best with
Python. Reading in the whole file is rarely too heavy, and you omit
the python "object overhead" entirely - all the code executes in the
fast C extensions.

If the file is too big, you might want to look up mmap:

http://effbot.org/librarybook/mmap.htm

Jun 27 '08 #13

=?ISO-8859-1?Q?Ricardo_Ar=E1oz?=

Ville M. Vainio wrote:

Ricardo Aráoz <ri******@gmail.comwrites:

>The easy/simple (too easy/simple?) way I see out of it is to read THE
WHOLE file into memory and don't worry. But what if the file is too

The easiest and simplest approach is often the best with
Python.

Keep forgetting that!

>
If the file is too big, you might want to look up mmap:

http://effbot.org/librarybook/mmap.htm

Thanks!

Jun 27 '08 #14

Similar topics

Python-2.3.4 on OSF1 V4.0?

by: Edmond Rusjan | last post by:

Hi All, I'd like to use Python-2.3.4 on OSF1 V4.0, but have trouble installing. With a plain "./configure; make" build, I cannot import socket. If I uncomment the socketmodule in Modules/Setup,...

Python

Efficient grep using Python?

by: sf | last post by:

Just started thinking about learning python. Is there any place where I can get some free examples, especially for following kind of problem ( it must be trivial for those using python) I have...

Python

Python or PHP?

by: Lad | last post by:

Is anyone capable of providing Python advantages over PHP if there are any? Cheers, L.

Python

idutils and Python

by: Ramon Diaz-Uriarte | last post by:

Dear All, Has anybody tried to use ID Utils (http://www.gnu.org/software/idutils/46) with Python? I've googled, searched the mailing list, and have found nothing. A silly, simple use of...

Python

sed to python: replace Q

by: Raymond | last post by:

For some reason I'm unable to grok Python's string.replace() function. Just trying to parse a simple IP address, wrapped in square brackets, from Postfix logs. In sed this is straightforward given:...

Python

flock seems very unsafe, python fcntl bug?

by: xucs007 | last post by:

I ran following 2 programs (lock1, lock2) at almost same time, to write either "123456", or "222" to file "aaa" at the same time. But I often just got "222456" in "aaa" . Is this a bug of python...

Python

Re: How do you check if a program/process is running using python?

by: Venky K Shankar | last post by:

On Sunday 20 July 2008 12:08:49 am Lamonte Harris wrote: you can execute OS system call. here i execute ps -ef and grep the required process name (or you can grep by pid on that particular...

Python

Re: Terminate a python script from linux shell / bash script

by: Nigel Rantor | last post by:

Gros Bedo wrote: If you have a long running process you wish to be able to kill at a later date the normal way of doing it would be for the script itself to write it's own PID to a file that you...

Python

Processes in Linux from Python

by: Johny | last post by:

To get a number of the http processes running on my Linux( Debia box) I use ps -ef | grep "ttpd" | wc -l But If I want to use to get a number of the http processes from my Python program I...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice