Efficient grep using Python? - Page 2

Just started thinking about learning python.

Is there any place where I can get some free examples, especially for
following kind of problem ( it must be trivial for those using python)

I have files A, and B each containing say 100,000 lines (each line=one
string without any space)

I want to do

" A - (A intersection B) "

Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.

Jul 18 '05

Subscribe Reply

11796

The point is that when you have 100,000s of records, this grep becomes
really slow?

Any comments?

Thats why I looked for python :)

that would be

grep -vf B A

and it is a rare use of grep, indeed.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #11

Christos TZOTZIOY Georgiou

On Fri, 17 Dec 2004 12:21:08 +0000, rumours say that P@draigBrady.co m
might have written:

[snip some damn lie aka "benchmark"]

[me]

(Yes, I cheated by adding the F (for no regular expressions) flag :)
Also you only have 1000 entries in B!
Try it again with all entries in B also ;-)
Remember the original poster had 100K entries!

Well, that's the closest I can do:

$ py
Python 2.4c1 (#3, Nov 26 2004, 23:39:44)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright" , "credits" or "license" for more information.
import sys; sys.ps1='.>>' ..>> alist=[line.strip() for line in open('/usr/share/dict/words')]
..>> words=set()
..>> for word in alist:
.... words.add(word + '\n')
.... words.add(word[::-1] + '\n')
....
..>> len(words)
90525
..>> words=list(word s)
..>> open('/tmp/A', 'w').writelines (words)
..>> import random; random.shuffle( words)
..>> open('/tmp/B', 'w').writelines (words[:90000])
..>>
$ time sort A B B | uniq -u >/dev/null

real 0m2.408s
user 0m2.437s
sys 0m0.037s
$ time grep -Fvf B A >/dev/null

real 0m1.208s
user 0m1.161s
sys 0m0.035s

What now?-)

Mind you, I only replied in the first place because you wrote (my
emphasis) "...here is *the* unix way..." and it's the bad days of the
month (not mine, actually, but I suffer along...)
and finally destroys original line
order (should it be important).

true

That's our final agreement :)

Note the order is trivial to restore with a
"decorate-sort-undecorate" idiom.

Using python or unix tools (eg 'paste -d', 'sort -k', 'cut -d')?
Because the python way has been already discussed by Friedrik, John and
Tim, and the unix way gets overly complicated (aka non-trivial) if DSU
is involved.

BTW, the following occurred to me:

tzot@tril/tmp
$ cat >A
aa
ss
dd
ff
gg
hh
jj
kk
ll
aa
tzot@tril/tmp
$ cat >B
ss
ff
hh
kk
tzot@tril/tmp
$ sort A B B | uniq -u
dd
gg
jj
ll
tzot@tril/tmp
$ grep -Fvf B A
aa
dd
gg
jj
ll
aa

Note that 'aa' is contained twice in the A file (to be filtered by B).
So our methods do not produce the same output. As far as the OP wrote:
Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.

grep is the unix way to go for both speed and correctness.

I would call this issue a dead horse.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #12

sf wrote:

The point is that when you have 100,000s of records, this grep becomes
really slow?

There are performance bugs with current versions of grep
and multibyte characters that are only getting addressed now.
To work around these do `export LANG=C` first.

In my experience grep is not scalable since it's O(n^2).
See below (note A and B are randomized versions of
/usr/share/dict/words (and therefore worst case for the
sort method)).

$ wc -l A B
45427 A
45427 B

$ export LANG=C

$ time grep -Fvf B A
real 0m0.437s

$ time sort A B B | uniq -u
real 0m0.262s

$ rpm -q grep coreutils
grep-2.5.1-16.1
coreutils-4.5.3-19

--
Pádraig Brady - http://www.pixelbeat.org
--

Jul 18 '05 #13

Christos TZOTZIOY Georgiou

On Fri, 17 Dec 2004 14:22:34 +0000, rumours say that P@draigBrady.co m
might have written:

sf:

sf wrote:
The point is that when you have 100,000s of records, this grep becomes
really slow?
There are performance bugs with current versions of grep
and multibyte characters that are only getting addressed now.
To work around these do `export LANG=C` first.

You also should use the -F flag that Pádraig suggests, since you don't
have regular expressions in the B file.
In my experience grep is not scalable since it's O(n^2).
See below (note A and B are randomized versions of
/usr/share/dict/words (and therefore worst case for the
sort method)).

$ wc -l A B
45427 A
45427 B

$ export LANG=C

$ time grep -Fvf B A
real 0m0.437s

$ time sort A B B | uniq -u
real 0m0.262s

$ rpm -q grep coreutils
grep-2.5.1-16.1
coreutils-4.5.3-19

sf, you better do your own benchmarks (there is quick, sample code in
other posts of mine and Pádraig's) on your machine, since on my test
machine the numbers are reversed re to these of Pádraig's (grep takes
half the time).

package versions (on SuSE 9.1 64-bit):

$ rpm -q grep coreutils
grep-2.5.1-427
coreutils-5.2.1-21

language:
$ echo $LANG
en_US.UTF-8

Caution: both solutions are interexchangeab le as long as you don't have
duplicate lines in the A file. If you do, use the grep version.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #14

Similar topics

3292

grep

by: David Isaac | last post by:

What's the standard replacement for the obsolete grep module? Thanks, Alan Isaac

Python

1553

regd efficient methods to manipulate *large* files

by: Madhusudhanan Chandrasekaran | last post by:

Hi: This question is not directed "entirely" at python only. But since I want to know how to do it in python, I am posting here. I am constructing a huge matrix (m x n), whose columns n are stored in smaller files. Once I read m such files, my matrix is complete. I want to pass this matrix as an input to another script of mine (I just have the binary.) Currently, the script reads a file (which is nothing but the

Python

3773

efficient text file search.

by: noro | last post by:

Is there a more efficient method to find a string in a text file then: f=file('somefile') for line in f: if 'string' in line: print 'FOUND' ? BTW:

Python

4871

Can I beat perl at grep-like processing speed?

by: js | last post by:

Just my curiosity. Can python beats perl at speed of grep-like processing? $ wget http://www.gutenberg.org/files/7999/7999-h.zip $ unzip 7999-h.zip $ cd 7999-h $ cat *.htm bigfile $ du -h bigfile du -h bigfile

Python

8231

Grep Equivalent for Python

by: tereglow | last post by:

Hello all, I come from a shell/perl background and have just to learn python. To start with, I'm trying to obtain system information from a Linux server using the /proc FS. For example, in order to obtain the amount of physical memory on the server, I would do the following in shell: grep ^MemTotal /proc/meminfo | awk '{print $2}' That would get me the exact number that I need. Now, I'm trying to do

Python

2379

efficient searching methods than grep

by: Manogna | last post by:

hi ! I have two big files like A anb B. A contains nearly 1000 lines. and B contains 1500000 lines.

Perl

10131

python vs. grep

by: Anton Slesarev | last post by:

I've read great paper about generators: http://www.dabeaz.com/generators/index.html Author say that it's easy to write analog of common linux tools such as awk,grep etc. He say that performance could be even better. But I have some problem with writing performance grep analog. It's my script:

Python

3451

re.search much slower then grep on some regular expressions

by: Henning_Thornblad | last post by:

What can be the cause of the large difference between re.search and grep? This script takes about 5 min to run on my computer: #!/usr/bin/env python import re row="" for a in range(156000): row+="a"

Python

10426

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10154

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9993

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

9029

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7537

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6776

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5430

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5558

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4109

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp