Efficient grep using Python?

Just started thinking about learning python.

Is there any place where I can get some free examples, especially for
following kind of problem ( it must be trivial for those using python)

I have files A, and B each containing say 100,000 lines (each line=one
string without any space)

I want to do

" A - (A intersection B) "

Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.

Jul 18 '05 #1

Subscribe Post Reply

11752

John Hunter

>>>>> "sf" == sf <sf@sf.sf> writes:

sf> Just started thinking about learning python. Is there any
sf> place where I can get some free examples, especially for
sf> following kind of problem ( it must be trivial for those using
sf> python)

sf> I have files A, and B each containing say 100,000 lines (each
sf> line=one string without any space)

sf> I want to do

sf> " A - (A intersection B) "

sf> Essentially, want to do efficient grep, i..e from A remove
sf> those lines which are also present in file B.

If you're only talking about 100K lines or so, and you have a
reasonably modern computer, you can do this all in memory. If order
doesn't matter (it probably does) you can use a set to get all the
lines in file B that are not in A

from sets import Set
A = Set(file('test1.dat').readlines())
B = Set(file('test2.dat').readlines())
print B-A

To preserve order, you should use a dictionary that maps lines to line
numbers. You can later use these numbers to sort

A = dict([(line, num) for num,line in enumerate(file('test1.dat'))])
B = dict([(line, num) for num,line in enumerate(file('test2.dat'))])

keep = [(num, line) for line,num in B.items() if not A.has_key(line)]
keep.sort()
for num, line in keep:
print line,

Now someone else will come along and tell you all this functionality
is already in the standard library. But it's always fun to hack this
out yourself once because python makes such things so damned easy.

JDH

Jul 18 '05 #2

Fredrik Lundh

"sf" <sf@sf.sf> wrote:

I have files A, and B each containing say 100,000 lines (each line=one
string without any space)

I want to do

" A - (A intersection B) "

Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.

that's an unusual definition of "grep", but the following seems to
do what you want:

afile = "a.txt"
bfile = "b.txt"

bdict = dict.fromkeys(open(bfile).readlines())

for line in open(afile):
if line not in bdict:
print line,

</F>

Jul 18 '05 #3

sf wrote:

Just started thinking about learning python.

Is there any place where I can get some free examples, especially for
following kind of problem ( it must be trivial for those using python)

I have files A, and B each containing say 100,000 lines (each line=one
string without any space)

I want to do

" A - (A intersection B) "

Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.

You could implement elegantly using the new sets feature
For reference here is the unix way to do it:

sort a b b | uniq -u

--
Pádraig Brady - http://www.pixelbeat.org
--

Jul 18 '05 #4

Tim Peters

["sf" <sf@sf.sf>]

I have files A, and B each containing say 100,000 lines (each
line=one string without any space)

I want to do

" A - (A intersection B) "

Essentially, want to do efficient grep, i..e from A remove those
lines which are also present in file B.

[Fredrik Lundh] that's an unusual definition of "grep", but the following seems to
do what you want:

afile = "a.txt"
bfile = "b.txt"

bdict = dict.fromkeys(open(bfile).readlines())

for line in open(afile):
if line not in bdict:
print line,

</F>

Note that an open file is an iterable object, yielding the lines in
the file. The "for" loop exploited that above, but fromkeys() can
also exploit it. That is,

bdict = dict.fromkeys(open(bfile))

is good enough (there's no need for the .readlines()).

Jul 18 '05 #5

Fredrik Lundh

Tim Peters wrote:

bdict = dict.fromkeys(open(bfile).readlines())

for line in open(afile):
if line not in bdict:
print line,

</F>

Note that an open file is an iterable object, yielding the lines in
the file. The "for" loop exploited that above, but fromkeys() can
also exploit it. That is,

bdict = dict.fromkeys(open(bfile))

is good enough (there's no need for the .readlines()).

(sigh. my brain knows that, but my fingers keep forgetting)

and yes, for this purpose, "dict.fromkeys" can be replaced
with "set".

bdict = set(open(bfile))

(and then you can save a few more bytes by renaming the
variable...)

</F>

Jul 18 '05 #6

Tim Peters

[Fredrik Lundh]

bdict = dict.fromkeys(open(bfile).readlines())

for line in open(afile):
if line not in bdict:
print line,

</F>

[Tim Peters] Note that an open file is an iterable object, yielding the lines in
the file. The "for" loop exploited that above, but fromkeys() can
also exploit it. That is,

bdict = dict.fromkeys(open(bfile))

is good enough (there's no need for the .readlines()).

[/F] (sigh. my brain knows that, but my fingers keep forgetting)

and yes, for this purpose, "dict.fromkeys" can be replaced
with "set".

bdict = set(open(bfile))

(and then you can save a few more bytes by renaming the
variable...)

Except the latter two are just shallow spelling changes. Switching
from fromkeys(open(f).readlines()) to fromkeys(open(f)) is much more
interesting, since it can allow major reduction in memory use. Even
if all the lines in the file are pairwise distinct, not materializing
them into a giant list can be a significant win. I wouldn't have
bothered replying if the only point were that you can save a couple
bytes of typing <wink>.

Jul 18 '05 #7

Christos TZOTZIOY Georgiou wrote:

On Wed, 15 Dec 2004 16:10:08 +0000, rumours say that P@draigBrady.com
might have written:

Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.
You could implement elegantly using the new sets feature
For reference here is the unix way to do it:

sort a b b | uniq -u

No, like I just wrote in another post, he wants

$ grep -vf B A

I think that

$ sort A B B | uniq -u

can be abbreviated to

$ sort -u A B B

which is the union rather than the intersection of the files

wrong. Notice the -u option to uniq.
http://marc.free.net.ph/message/2002....1bc24964.html
wastes some time by considering B twice
I challenge you to a benchmark :-)
and finally destroys original line
order (should it be important).

true

--
Pádraig Brady - http://www.pixelbeat.org
--

Jul 18 '05 #8

Christos TZOTZIOY Georgiou

On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that P@draigBrady.com
might have written:

[sf]

Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.
[p@draig]You could implement elegantly using the new sets feature
For reference here is the unix way to do it:

sort a b b | uniq -u
[Christos]
No, like I just wrote in another post, he wants

$ grep -vf B A

I think that

$ sort A B B | uniq -u

can be abbreviated to

$ sort -u A B B

which is the union rather than the intersection of the files
[P@draig]
wrong. Notice the -u option to uniq.
http://marc.free.net.ph/message/2002....1bc24964.html

I see your point. That's a new to me use of uniq, since I started using
Unices long before GNU versions of the tools, but then, I might have
missed the -u option.

$ cat A
aa
ss
dd
ff
gg
hh
jj
kk
$ cat B
aa
dd
gg
jj
pp
$ time sort A B B | uniq -u
ff
hh
kk
ss

real 0m0.004s
user 0m0.004s
sys 0m0.004s
$ time grep -vf B A
ss
ff
hh
kk

real 0m0.003s
user 0m0.000s
sys 0m0.003s

So I stand corrected that your solution does *not* give the union.
wastes some time by considering B twice

I challenge you to a benchmark :-)

Well, the numbers I provided above are almost meaningless with such a
small set (and they easily could be reverse, I just kept the
convenient-to-me first run :). Do you really believe that sorting three
files and then scanning their merged output counting duplicates is
faster than scanning two files (and doing lookups during the second
scan)?

$ python
Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information. x=open('/usr/share/dict/words').readlines()
len(x) 45378 import random
random.shuffle(x)
open("/tmp/A", "w").writelines(x)
random.shuffle(x)
open("/tmp/B", "w").writelines(x[:1000])
$ time sort A B B | uniq -u >/dev/null

real 0m0.311s
user 0m0.315s
sys 0m0.008s
$ time grep -Fvf B A >/dev/null

real 0m0.067s
user 0m0.064s
sys 0m0.003s

(Yes, I cheated by adding the F (for no regular expressions) flag :)
and finally destroys original line
order (should it be important).

true

That's our final agreement :)
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #9

Christos TZOTZIOY Georgiou wrote:

On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that P@draigBrady.com
I challenge you to a benchmark :-)

Well, the numbers I provided above are almost meaningless with such a
small set (and they easily could be reverse, I just kept the
convenient-to-me first run :). Do you really believe that sorting three
files and then scanning their merged output counting duplicates is
faster than scanning two files (and doing lookups during the second
scan)?

$ python
Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
x=open('/usr/share/dict/words').readlines()
len(x)
45378
import random
random.shuffle(x)
open("/tmp/A", "w").writelines(x)
random.shuffle(x)
open("/tmp/B", "w").writelines(x[:1000])

$ time sort A B B | uniq -u >/dev/null

real 0m0.311s
user 0m0.315s
sys 0m0.008s
$ time grep -Fvf B A >/dev/null

real 0m0.067s
user 0m0.064s
sys 0m0.003s

(Yes, I cheated by adding the F (for no regular expressions) flag :)

Also you only have 1000 entries in B!
Try it again with all entries in B also ;-)
Remember the original poster had 100K entries!

and finally destroys original line
order (should it be important).

true

That's our final agreement :)

Note the order is trivial to restore with a
"decorate-sort-undecorate" idiom.

--
Pádraig Brady - http://www.pixelbeat.org
--

Jul 18 '05 #10

The point is that when you have 100,000s of records, this grep becomes
really slow?

Any comments?

Thats why I looked for python :)

that would be

grep -vf B A

and it is a rare use of grep, indeed.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #11

Christos TZOTZIOY Georgiou

On Fri, 17 Dec 2004 12:21:08 +0000, rumours say that P@draigBrady.com
might have written:

[snip some damn lie aka "benchmark"]

[me]

(Yes, I cheated by adding the F (for no regular expressions) flag :)
Also you only have 1000 entries in B!
Try it again with all entries in B also ;-)
Remember the original poster had 100K entries!

Well, that's the closest I can do:

$ py
Python 2.4c1 (#3, Nov 26 2004, 23:39:44)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import sys; sys.ps1='.>>' ..>> alist=[line.strip() for line in open('/usr/share/dict/words')]
..>> words=set()
..>> for word in alist:
.... words.add(word + '\n')
.... words.add(word[::-1] + '\n')
....
..>> len(words)
90525
..>> words=list(words)
..>> open('/tmp/A', 'w').writelines(words)
..>> import random; random.shuffle(words)
..>> open('/tmp/B', 'w').writelines(words[:90000])
..>>
$ time sort A B B | uniq -u >/dev/null

real 0m2.408s
user 0m2.437s
sys 0m0.037s
$ time grep -Fvf B A >/dev/null

real 0m1.208s
user 0m1.161s
sys 0m0.035s

What now?-)

Mind you, I only replied in the first place because you wrote (my
emphasis) "...here is *the* unix way..." and it's the bad days of the
month (not mine, actually, but I suffer along...)
and finally destroys original line
order (should it be important).

true

That's our final agreement :)

Note the order is trivial to restore with a
"decorate-sort-undecorate" idiom.

Using python or unix tools (eg 'paste -d', 'sort -k', 'cut -d')?
Because the python way has been already discussed by Friedrik, John and
Tim, and the unix way gets overly complicated (aka non-trivial) if DSU
is involved.

BTW, the following occurred to me:

tzot@tril/tmp
$ cat >A
aa
ss
dd
ff
gg
hh
jj
kk
ll
aa
tzot@tril/tmp
$ cat >B
ss
ff
hh
kk
tzot@tril/tmp
$ sort A B B | uniq -u
dd
gg
jj
ll
tzot@tril/tmp
$ grep -Fvf B A
aa
dd
gg
jj
ll
aa

Note that 'aa' is contained twice in the A file (to be filtered by B).
So our methods do not produce the same output. As far as the OP wrote:
Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.

grep is the unix way to go for both speed and correctness.

I would call this issue a dead horse.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #12

sf wrote:

The point is that when you have 100,000s of records, this grep becomes
really slow?

There are performance bugs with current versions of grep
and multibyte characters that are only getting addressed now.
To work around these do `export LANG=C` first.

In my experience grep is not scalable since it's O(n^2).
See below (note A and B are randomized versions of
/usr/share/dict/words (and therefore worst case for the
sort method)).

$ wc -l A B
45427 A
45427 B

$ export LANG=C

$ time grep -Fvf B A
real 0m0.437s

$ time sort A B B | uniq -u
real 0m0.262s

$ rpm -q grep coreutils
grep-2.5.1-16.1
coreutils-4.5.3-19

--
Pádraig Brady - http://www.pixelbeat.org
--

Jul 18 '05 #13

Christos TZOTZIOY Georgiou

On Fri, 17 Dec 2004 14:22:34 +0000, rumours say that P@draigBrady.com
might have written:

sf:

sf wrote:
The point is that when you have 100,000s of records, this grep becomes
really slow?
There are performance bugs with current versions of grep
and multibyte characters that are only getting addressed now.
To work around these do `export LANG=C` first.

You also should use the -F flag that Pádraig suggests, since you don't
have regular expressions in the B file.
In my experience grep is not scalable since it's O(n^2).
See below (note A and B are randomized versions of
/usr/share/dict/words (and therefore worst case for the
sort method)).

$ wc -l A B
45427 A
45427 B

$ export LANG=C

$ time grep -Fvf B A
real 0m0.437s

$ time sort A B B | uniq -u
real 0m0.262s

$ rpm -q grep coreutils
grep-2.5.1-16.1
coreutils-4.5.3-19

sf, you better do your own benchmarks (there is quick, sample code in
other posts of mine and Pádraig's) on your machine, since on my test
machine the numbers are reversed re to these of Pádraig's (grep takes
half the time).

package versions (on SuSE 9.1 64-bit):

$ rpm -q grep coreutils
grep-2.5.1-427
coreutils-5.2.1-21

language:
$ echo $LANG
en_US.UTF-8

Caution: both solutions are interexchangeable as long as you don't have
duplicate lines in the A file. If you do, use the grep version.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...

Jul 18 '05 #14

Similar topics

grep

by: David Isaac | last post by:

What's the standard replacement for the obsolete grep module? Thanks, Alan Isaac

Python

regd efficient methods to manipulate *large* files

by: Madhusudhanan Chandrasekaran | last post by:

Hi: This question is not directed "entirely" at python only. But since I want to know how to do it in python, I am posting here. I am constructing a huge matrix (m x n), whose columns n are...

Python

efficient text file search.

by: noro | last post by:

Is there a more efficient method to find a string in a text file then: f=file('somefile') for line in f: if 'string' in line: print 'FOUND' ? BTW:

Python

Can I beat perl at grep-like processing speed?

by: js | last post by:

Just my curiosity. Can python beats perl at speed of grep-like processing? $ wget http://www.gutenberg.org/files/7999/7999-h.zip $ unzip 7999-h.zip $ cd 7999-h $ cat *.htm bigfile $ du -h...

Python

Grep Equivalent for Python

by: tereglow | last post by:

Hello all, I come from a shell/perl background and have just to learn python. To start with, I'm trying to obtain system information from a Linux server using the /proc FS. For example, in...

Python

efficient searching methods than grep

by: Manogna | last post by:

hi ! I have two big files like A anb B. A contains nearly 1000 lines. and B contains 1500000 lines.

Perl

python vs. grep

by: Anton Slesarev | last post by:

I've read great paper about generators: http://www.dabeaz.com/generators/index.html Author say that it's easy to write analog of common linux tools such as awk,grep etc. He say that performance...

Python

re.search much slower then grep on some regular expressions

by: Henning_Thornblad | last post by:

What can be the cause of the large difference between re.search and grep? This script takes about 5 min to run on my computer: #!/usr/bin/env python import re row="" for a in range(156000):...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice