Efficient grep using Python?

Just started thinking about learning python.

Is there any place where I can get some free examples, especially for
following kind of problem ( it must be trivial for those using python)

I have files A, and B each containing say 100,000 lines (each line=one
string without any space)

I want to do

" A - (A intersection B) "

Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.

Jul 18 '05 #1

Subscribe Reply

11794

John Hunter

>>>>> "sf" == sf <sf@sf.sf> writes:

sf> Just started thinking about learning python. Is there any
sf> place where I can get some free examples, especially for
sf> following kind of problem ( it must be trivial for those using
sf> python)

sf> I have files A, and B each containing say 100,000 lines (each
sf> line=one string without any space)

sf> I want to do

sf> " A - (A intersection B) "

sf> Essentially, want to do efficient grep, i..e from A remove
sf> those lines which are also present in file B.

If you're only talking about 100K lines or so, and you have a
reasonably modern computer, you can do this all in memory. If order
doesn't matter (it probably does) you can use a set to get all the
lines in file B that are not in A

from sets import Set
A = Set(file('test1 .dat').readline s())
B = Set(file('test2 .dat').readline s())
print B-A

To preserve order, you should use a dictionary that maps lines to line
numbers. You can later use these numbers to sort

A = dict([(line, num) for num,line in enumerate(file( 'test1.dat'))])
B = dict([(line, num) for num,line in enumerate(file( 'test2.dat'))])

keep = [(num, line) for line,num in B.items() if not A.has_key(line)]
keep.sort()
for num, line in keep:
print line,

Now someone else will come along and tell you all this functionality
is already in the standard library. But it's always fun to hack this
out yourself once because python makes such things so damned easy.

JDH

Jul 18 '05 #2

Fredrik Lundh

"sf" <sf@sf.sf> wrote:

I have files A, and B each containing say 100,000 lines (each line=one
string without any space)

I want to do

" A - (A intersection B) "

Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.

that's an unusual definition of "grep", but the following seems to
do what you want:

afile = "a.txt"
bfile = "b.txt"

bdict = dict.fromkeys(o pen(bfile).read lines())

for line in open(afile):
if line not in bdict:
print line,

</F>

Jul 18 '05 #3

sf wrote:

Just started thinking about learning python.

Is there any place where I can get some free examples, especially for
following kind of problem ( it must be trivial for those using python)

I have files A, and B each containing say 100,000 lines (each line=one
string without any space)

I want to do

" A - (A intersection B) "

Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.

You could implement elegantly using the new sets feature
For reference here is the unix way to do it:

sort a b b | uniq -u

--
Pádraig Brady - http://www.pixelbeat.org
--

Jul 18 '05 #4

Tim Peters

["sf" <sf@sf.sf>]

I have files A, and B each containing say 100,000 lines (each
line=one string without any space)

I want to do

" A - (A intersection B) "

Essentially, want to do efficient grep, i..e from A remove those
lines which are also present in file B.

[Fredrik Lundh] that's an unusual definition of "grep", but the following seems to
do what you want:

afile = "a.txt"
bfile = "b.txt"

bdict = dict.fromkeys(o pen(bfile).read lines())

for line in open(afile):
if line not in bdict:
print line,

</F>

Note that an open file is an iterable object, yielding the lines in
the file. The "for" loop exploited that above, but fromkeys() can
also exploit it. That is,

bdict = dict.fromkeys(o pen(bfile))

is good enough (there's no need for the .readlines()).

Jul 18 '05 #5

Fredrik Lundh

Tim Peters wrote:

bdict = dict.fromkeys(o pen(bfile).read lines())

for line in open(afile):
if line not in bdict:
print line,

</F>

Note that an open file is an iterable object, yielding the lines in
the file. The "for" loop exploited that above, but fromkeys() can
also exploit it. That is,

bdict = dict.fromkeys(o pen(bfile))

is good enough (there's no need for the .readlines()).

(sigh. my brain knows that, but my fingers keep forgetting)

and yes, for this purpose, "dict.fromk eys" can be replaced
with "set".

bdict = set(open(bfile) )

(and then you can save a few more bytes by renaming the
variable...)

</F>

Jul 18 '05 #6

Tim Peters

[Fredrik Lundh]

bdict = dict.fromkeys(o pen(bfile).read lines())

for line in open(afile):
if line not in bdict:
print line,

</F>

[Tim Peters] Note that an open file is an iterable object, yielding the lines in
the file. The "for" loop exploited that above, but fromkeys() can
also exploit it. That is,

bdict = dict.fromkeys(o pen(bfile))

is good enough (there's no need for the .readlines()).

[/F] (sigh. my brain knows that, but my fingers keep forgetting)

and yes, for this purpose, "dict.fromk eys" can be replaced
with "set".

bdict = set(open(bfile) )

(and then you can save a few more bytes by renaming the
variable...)

Except the latter two are just shallow spelling changes. Switching
from fromkeys(open(f ).readlines()) to fromkeys(open(f )) is much more
interesting, since it can allow major reduction in memory use. Even
if all the lines in the file are pairwise distinct, not materializing
them into a giant list can be a significant win. I wouldn't have
bothered replying if the only point were that you can save a couple
bytes of typing <wink>.

Jul 18 '05 #7

Christos TZOTZIOY Georgiou wrote:

On Wed, 15 Dec 2004 16:10:08 +0000, rumours say that P@draigBrady.co m
might have written:

Essentiall y, want to do efficient grep, i..e from A remove those lines which
are also present in file B.
You could implement elegantly using the new sets feature
For reference here is the unix way to do it:

sort a b b | uniq -u

No, like I just wrote in another post, he wants

$ grep -vf B A

I think that

$ sort A B B | uniq -u

can be abbreviated to

$ sort -u A B B

which is the union rather than the intersection of the files

wrong. Notice the -u option to uniq.
http://marc.free.net.ph/message/2002....1bc24964.html
wastes some time by considering B twice
I challenge you to a benchmark :-)
and finally destroys original line
order (should it be important).

true

--
Pádraig Brady - http://www.pixelbeat.org
--

Jul 18 '05 #8

Christos TZOTZIOY Georgiou

On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that P@draigBrady.co m
might have written:

[sf]

Essentially , want to do efficient grep, i..e from A remove those lines which
are also present in file B.
[p@draig]You could implement elegantly using the new sets feature
For reference here is the unix way to do it:

sort a b b | uniq -u
[Christos]
No, like I just wrote in another post, he wants

$ grep -vf B A

I think that

$ sort A B B | uniq -u

can be abbreviated to

$ sort -u A B B

which is the union rather than the intersection of the files
[P@draig]
wrong. Notice the -u option to uniq.
http://marc.free.net.ph/message/2002....1bc24964.html

I see your point. That's a new to me use of uniq, since I started using
Unices long before GNU versions of the tools, but then, I might have
missed the -u option.

$ cat A
aa
ss
dd
ff
gg
hh
jj
kk
$ cat B
aa
dd
gg
jj
pp
$ time sort A B B | uniq -u
ff
hh
kk
ss

real 0m0.004s
user 0m0.004s
sys 0m0.004s
$ time grep -vf B A
ss
ff
hh
kk

real 0m0.003s
user 0m0.000s
sys 0m0.003s

So I stand corrected that your solution does *not* give the union.
wastes some time by considering B twice

I challenge you to a benchmark :-)

Well, the numbers I provided above are almost meaningless with such a
small set (and they easily could be reverse, I just kept the
convenient-to-me first run :). Do you really believe that sorting three
files and then scanning their merged output counting duplicates is
faster than scanning two files (and doing lookups during the second
scan)?

$ python
Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright" , "credits" or "license" for more information. x=open('/usr/share/dict/words').readlin es()
len(x) 45378 import random
random.shuffle( x)
open("/tmp/A", "w").writelines (x)
random.shuffle( x)
open("/tmp/B", "w").writelines (x[:1000])
$ time sort A B B | uniq -u >/dev/null

real 0m0.311s
user 0m0.315s
sys 0m0.008s
$ time grep -Fvf B A >/dev/null

real 0m0.067s
user 0m0.064s
sys 0m0.003s

(Yes, I cheated by adding the F (for no regular expressions) flag :)
and finally destroys original line
order (should it be important).

true

That's our final agreement :)
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #9

Christos TZOTZIOY Georgiou wrote:

On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that P@draigBrady.co m
I challenge you to a benchmark :-)

Well, the numbers I provided above are almost meaningless with such a
small set (and they easily could be reverse, I just kept the
convenient-to-me first run :). Do you really believe that sorting three
files and then scanning their merged output counting duplicates is
faster than scanning two files (and doing lookups during the second
scan)?

$ python
Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright" , "credits" or "license" for more information.
x=open('/usr/share/dict/words').readlin es()
len(x)
45378
import random
random.shuf fle(x)
open("/tmp/A", "w").writelines (x)
random.shuf fle(x)
open("/tmp/B", "w").writelines (x[:1000])

$ time sort A B B | uniq -u >/dev/null

real 0m0.311s
user 0m0.315s
sys 0m0.008s
$ time grep -Fvf B A >/dev/null

real 0m0.067s
user 0m0.064s
sys 0m0.003s

(Yes, I cheated by adding the F (for no regular expressions) flag :)

Also you only have 1000 entries in B!
Try it again with all entries in B also ;-)
Remember the original poster had 100K entries!

and finally destroys original line
order (should it be important).

true

That's our final agreement :)

Note the order is trivial to restore with a
"decorate-sort-undecorate" idiom.

--
Pádraig Brady - http://www.pixelbeat.org
--

Jul 18 '05 #10

Similar topics

3290

grep

by: David Isaac | last post by:

What's the standard replacement for the obsolete grep module? Thanks, Alan Isaac

Python

1553

regd efficient methods to manipulate *large* files

by: Madhusudhanan Chandrasekaran | last post by:

Hi: This question is not directed "entirely" at python only. But since I want to know how to do it in python, I am posting here. I am constructing a huge matrix (m x n), whose columns n are stored in smaller files. Once I read m such files, my matrix is complete. I want to pass this matrix as an input to another script of mine (I just have the binary.) Currently, the script reads a file (which is nothing but the

Python

3772

efficient text file search.

by: noro | last post by:

Is there a more efficient method to find a string in a text file then: f=file('somefile') for line in f: if 'string' in line: print 'FOUND' ? BTW:

Python

4869

Can I beat perl at grep-like processing speed?

by: js | last post by:

Just my curiosity. Can python beats perl at speed of grep-like processing? $ wget http://www.gutenberg.org/files/7999/7999-h.zip $ unzip 7999-h.zip $ cd 7999-h $ cat *.htm bigfile $ du -h bigfile du -h bigfile

Python

8230

Grep Equivalent for Python

by: tereglow | last post by:

Hello all, I come from a shell/perl background and have just to learn python. To start with, I'm trying to obtain system information from a Linux server using the /proc FS. For example, in order to obtain the amount of physical memory on the server, I would do the following in shell: grep ^MemTotal /proc/meminfo | awk '{print $2}' That would get me the exact number that I need. Now, I'm trying to do

Python

2379

efficient searching methods than grep

by: Manogna | last post by:

hi ! I have two big files like A anb B. A contains nearly 1000 lines. and B contains 1500000 lines.

Perl

10130

python vs. grep

by: Anton Slesarev | last post by:

I've read great paper about generators: http://www.dabeaz.com/generators/index.html Author say that it's easy to write analog of common linux tools such as awk,grep etc. He say that performance could be even better. But I have some problem with writing performance grep analog. It's my script:

Python

3450

re.search much slower then grep on some regular expressions

by: Henning_Thornblad | last post by:

What can be the cause of the large difference between re.search and grep? This script takes about 5 min to run on my computer: #!/usr/bin/env python import re row="" for a in range(156000): row+="a"

Python

9639

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10308

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10143

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10076

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9939

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7486

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6729

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4040

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2870

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General