473,513 Members | 2,334 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to remove subset from a file efficiently?

Hi all,

I have two files:

- PSP0000320.dat (quite a large list of mobile numbers),
- CBR0000319.dat (a subset of the above, a list of barred bumbers)

# head PSP0000320.dat CBR0000319.dat
==> PSP0000320.dat <==
96653696338
96653766996
96654609431
96654722608
96654738074
96655697044
96655824738
96656190117
96656256762
96656263751

==> CBR0000319.dat <==
96651131135
96651131135
96651420412
96651730095
96652399117
96652399142
96652399142
96652399142
96652399160
96652399271

Objective: to remove the numbers present in barred-list from the
PSPfile.

$ ls -lh PSP0000320.dat CBR0000319.dat
... 56M Dec 28 19:41 PSP0000320.dat
... 8.6M Dec 28 19:40 CBR0000319.dat

$ wc -l PSP0000320.dat CBR0000319.dat
4,462,603 PSP0000320.dat
693,585 CBR0000319.dat

I wrote the following in python to do it:

#: c01:rmcommon.py
barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')

# reading it all in one go, so as to avoid frequent disk accesses
(assume machine has plenty memory)
barredlist.read()
postlist.read()

#
for number in postlist:
if number in barrlist:
pass
else:
outfile.write(number)

barredlist.close(); postlist.close(); outfile.close()
#:~

The above code simply takes too long to complete. If I were to do a
diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with
sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to
complete.

I wrote the following in bash to do the same:

#!/bin/bash

ARGS=2

if [ $# -ne $ARGS ] # takes two arguments
then
echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
echo; echo " eg.: `basename $0` PSP0000320.dat
CBR0000319.dat"; echo;
echo "NOTE: first argument: PSP file, second: CBR file";
echo " this script _does_ no_ input validation!"
exit 1
fi;

# fix prefix; cost: 12.587 secs
cat $1 | sed -e 's/^0*/966/' > $1.good
cat $2 | sed -e 's/^0*/966/' > $2.good

# sort/save files; for the 4,462,603 lines, cost: 36.589 secs
sort $1.good > $1.sorted
sort $2.good > $2.sorted

# diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
diff -y $1.sorted $2.sorted | grep "<" > $1.filtered

# remove trailing junk [spaces & <]; cost: 1 min 3 secs
cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned

# remove intermediate files, good, sorted, filtered
rm -f *.good *.sorted *.filtered
#:~

....but strangely though, there's a discrepancy, the reason for which I
can't figure out!

Needless to say, I'm utterly new to python and my programming skills &
know-how are rudimentary.

Any help will be genuinely appreciated.

--
fynali

Jan 12 '06 #1
22 1719
[fynali]
I have two files:

- PSP0000320.dat (quite a large list of mobile numbers),
- CBR0000319.dat (a subset of the above, a list of barred bumbers)


# print all non-barred mobile phone numbers
barred = set(open('CBR0000319.dat'))
for num in open('PSP0000320.dat'):
if num not in barred:
print num,

Raymond

Jan 12 '06 #2
AJL
On 12 Jan 2006 09:04:21 -0800
"fynali" <il******@gmail.com> wrote:
Hi all,

I have two files:

- PSP0000320.dat (quite a large list of mobile numbers),
- CBR0000319.dat (a subset of the above, a list of barred bumbers)
....
Objective: to remove the numbers present in barred-list from the
PSPfile.


How fast does this run?

a = set(file('PSP0000320.dat'))
b = set(file('CBR0000319.dat'))
file('PSP-CBR.dat', 'w').writelines(a.difference(b))

AJL
Jan 12 '06 #3
"fynali" <il******@gmail.com> writes:
Hi all,

I have two files:
Others have pointed out the Python solution - use a set instead of a
list for membership testing. I want to point out a better Unix
solution ('cause I probably wouldn't have written a Python program to
do this):
Objective: to remove the numbers present in barred-list from the
PSPfile.

$ ls -lh PSP0000320.dat CBR0000319.dat
... 56M Dec 28 19:41 PSP0000320.dat
... 8.6M Dec 28 19:40 CBR0000319.dat

$ wc -l PSP0000320.dat CBR0000319.dat
4,462,603 PSP0000320.dat
693,585 CBR0000319.dat

I wrote the following in bash to do the same:

#!/bin/bash

ARGS=2

if [ $# -ne $ARGS ] # takes two arguments
then
echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
echo; echo " eg.: `basename $0` PSP0000320.dat
CBR0000319.dat"; echo;
echo "NOTE: first argument: PSP file, second: CBR file";
echo " this script _does_ no_ input validation!"
exit 1
fi;

# fix prefix; cost: 12.587 secs
cat $1 | sed -e 's/^0*/966/' > $1.good
cat $2 | sed -e 's/^0*/966/' > $2.good

# sort/save files; for the 4,462,603 lines, cost: 36.589 secs
sort $1.good > $1.sorted
sort $2.good > $2.sorted

# diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
diff -y $1.sorted $2.sorted | grep "<" > $1.filtered

# remove trailing junk [spaces & <]; cost: 1 min 3 secs
cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned

# remove intermediate files, good, sorted, filtered
rm -f *.good *.sorted *.filtered
#:~

...but strangely though, there's a discrepancy, the reason for which I
can't figure out!


The above script can be shortened quite a bit:

#!/bin/sh

comm -23 <(sed 's/^0*/966/' $1 | sort) <(sed 's/^0*/966/ $2 | sort)

Will output only lines that occur in $1. It also runs the seds and
sorts in parallel, which can make a significant difference in the
clock time it takes to get the job done.

The Python version is probably faster, since it doesn't sort the
data.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Jan 12 '06 #4
On 01/12/2006-09:04AM, fynali wrote:

- PSP0000320.dat (quite a large list of mobile numbers),
- CBR0000319.dat (a subset of the above, a list of barred bumbers)


fgrep -x -v -f CBR0000319.dat PSP0000320.dat > PSP-CBR.dat

Jan 13 '06 #5
AJL wrote:
How fast does this run?

a = set(file('PSP0000320.dat'))
b = set(file('CBR0000319.dat'))
file('PSP-CBR.dat', 'w').writelines(a.difference(b))


Turning PSP into a set takes extra time, consumes unnecessary memory,
eliminates duplicates (possibly a bad thing), and loses the original
input ordering (probably a bad thing).

To jam the action into a couple lines, try this:

b = set(file('CBR0000319.dat'))
file('PSP-CBR.dat','w').writelines(itertools.ifilterfalse(b. __contains__,file('PSP0000320.dat')))

Raymond

Jan 13 '06 #6
AJL
On 12 Jan 2006 22:29:22 -0800
"Raymond Hettinger" <py****@rcn.com> wrote:
AJL wrote:
How fast does this run?

a = set(file('PSP0000320.dat'))
b = set(file('CBR0000319.dat'))
file('PSP-CBR.dat', 'w').writelines(a.difference(b))


Turning PSP into a set takes extra time, consumes unnecessary memory,
eliminates duplicates (possibly a bad thing), and loses the original
input ordering (probably a bad thing).

To jam the action into a couple lines, try this:

b = set(file('CBR0000319.dat'))
file('PSP-CBR.dat','w').writelines(itertools.ifilterfalse(b. __contains__,file('PSP0000320.dat')))

Raymond


The OP said "assume machine has plenty memory". ;)

I saw some solutions that used sets and was wondering why they stopped
at using a set for the first file and not the second when the problem is
really a set problem but I can see the reasoning behind it now.

AJL
Jan 13 '06 #7
$ time fgrep -x -v -f CBR0000333 PSP0000333 > PSP-CBR.dat.fgrep

real 0m31.551s
user 0m16.841s
sys 0m0.912s

--
$ time ./cleanup.py

real 0m6.080s
user 0m4.836s
sys 0m0.408s

--
$ wc -l PSP-CBR.dat.fgrep PSP-CBR.dat.python
3872421 PSP-CBR.dat.fgrep
3872421 PSP-CBR.dat.python

Fantastic, at any rate the time is down from my initial ~4 min.!

Thank you Chris. The fgrep approach is clean and to the point; and one
more reason to love the *nix approach to handling everyday problems.

Fredrik's set|dict approach in Python above gives me one more reason to
love Python. And it is indeed fast, 5x!

Thank you all for all your help.

--
fynali

Jan 14 '06 #8
$ cat cleanup_ray.py
#!/usr/bin/python
import itertools

b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))

file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfals e(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))

--
$ time ./cleanup_ray.py

real 0m5.451s
user 0m4.496s
sys 0m0.428s

(-: Damn! That saves a bit more time! Bravo!

Thanks to you Raymond.

Jan 14 '06 #9

fynali wrote:
$ cat cleanup_ray.py
#!/usr/bin/python
import itertools

b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))

file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfals e(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))

--
$ time ./cleanup_ray.py

real 0m5.451s
user 0m4.496s
sys 0m0.428s

(-: Damn! That saves a bit more time! Bravo!

Thanks to you Raymond.

Have you tried the explicit loop variant with psyco ? My experience is
that psyco is pretty good at optimizing for loop which usually results
in faster code than even built-in map/filter variant.

Though it would just be 1 or 2 sec difference(given what you already
have) so may not be important but could be fun.

Jan 14 '06 #10
--
$ ./cleanup.py
Traceback (most recent call last):
File "./cleanup.py", line 3, in ?
import itertools
ImportError: No module named itertools

--
$ time ./cleanup.py
File "./cleanup.py", line 8
outfile.writelines(number for number in postpaid_file if number
not in barred)
^
SyntaxError: invalid syntax

The earlier results I posted were run on my workstation which has
Python 2.4.1,

$ uname -a && python -V
Linux sajid 2.6.13-15.7-smp #1 SMP
Tue Nov 29 14:32:29 UTC 2005 i686 i686 i386 GNU/Linux
Python 2.4.1

but the server on which the actual processing will be done has an older
version )-:

$ uname -a && python -V
Linux cactus 2.4.21-20.ELsmp #1 SMP
Wed Aug 18 20:46:40 EDT 2004 i686 i686 i386 GNU/Linux
Python 2.2.3

Is a rewrite possible of Raymond's or Fredrik's suggestions above which
will still give me the time saving made?

--
fynali

Jan 14 '06 #11
[bonono]
Have you tried the explicit loop variant with psyco ?
Sure I wouldn't mind trying; can you suggest some code snippets along
the lines of which I should try...?

[fynali] Needless to say, I'm utterly new to python and my programming
skills & know-how are rudimentary.


(-:

--
fynali

Jan 14 '06 #12
"fynali" wrote:
Is a rewrite possible of Raymond's or Fredrik's suggestions above which
will still give me the time saving made?


Python 2.2 don't have a readymade set type (new in 2.3), and it doesn't
support generator expressions (the thing that caused the syntax error).

however, using a dictionary instead of the set

barred = {}
for number in open(open('/home/sjd/python/wip/CBR0000319.dat')):
barred[number] = None # just add it as a key

and a list comprehension instead of the generator expression

outfile.writelines([number for number in infile if number not in barred])

(note the extra brackets)

should give you decent performance under 2.2.

</F>

Jan 14 '06 #13

fynali wrote:
[bonono]
Have you tried the explicit loop variant with psyco ?


Sure I wouldn't mind trying; can you suggest some code snippets along
the lines of which I should try...?

[fynali]
> Needless to say, I'm utterly new to python and my programming
> skills & know-how are rudimentary.


(-:

--
fynali


just add :

import psyco
psyco.full()

to the beginning of the code from Fredrik for 2.2(the one using dict
and list comprehension). You system may not have the psyco module
though.

Jan 14 '06 #14
$ cat cleanup.py
#!/usr/bin/python

postpaid_file = open('/home/oracle/stc/test/PSP0000333')
outfile = open('/home/oracle/stc/test/PSP-CBR.dat', 'w')

barred = {}

for number in open('/home/oracle/stc/test/CBR0000333'):
barred[number] = None # just add it as a key

outfile.writelines([number for number in postpaid_file if number
not in barred])

postpaid_file.close(); outfile.close()

--
$ time ./cleanup.py

real 0m31.007s
user 0m24.660s
sys 0m3.550s

Can we say that using generators & newer Python _is_ faster?

--
fynali

Jan 14 '06 #15
$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python

import psyco
psyco.full()

postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')

barred = {}

for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key

outfile.writelines([number for number in postpaid_file if number
not in barred])

postpaid_file.close(); outfile.close()

--
$ time ./cleanup_use_psyco_and_list_compr.py

real 0m39.638s
user 0m5.532s
sys 0m0.868s

This was run on my machine (w/ Python 2.4.1), can't install psyco on
the actual server at the moment.

I guess using generators & newer Python is indeed faster|better.

--
fynali

Jan 14 '06 #16

fynali wrote:
$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python

import psyco
psyco.full()

postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')

barred = {}

for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key

outfile.writelines([number for number in postpaid_file if number
not in barred])

postpaid_file.close(); outfile.close()

--
$ time ./cleanup_use_psyco_and_list_compr.py

real 0m39.638s
user 0m5.532s
sys 0m0.868s

This was run on my machine (w/ Python 2.4.1), can't install psyco on
the actual server at the moment.

I guess using generators & newer Python is indeed faster|better.

--
fynali

um, strange, so psyco is slower than not using it ?

you may try to expand the list comprehension to :

for number in postpaid_file:
if number not in barred: outfile.writelines(number)

Jan 14 '06 #17
$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python

import psyco
psyco.full()

postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')

barred = {}

for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key

for number in postpaid_file:
if number not in barred: outfile.writelines(number)

postpaid_file.close(); outfile.close()

--
$ time ./cleanup_use_psyco_and_list_compr.py

real 0m24.293s
user 0m22.633s
sys 0m0.524s

Saves ~6 secs.

--
fynali

Jan 14 '06 #18
Sorry, pls read that ~15 secs.

Jan 14 '06 #19

fynali wrote:
Sorry, pls read that ~15 secs.

That is more or less about it. As set() is faster than dict(), about 2x
on my machine and I assume a portion of your time is in set/dict
creation as it is pretty large data set.

Jan 14 '06 #20
$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python

#import psyco
#psyco.full()

postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')

barred = {}

for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key

for number in postpaid_file:
if number not in barred: outfile.writelines(number)

postpaid_file.close(); outfile.close()

--
$ time ./cleanup_use_psyco_and_list_compr.py

real 0m22.587s
user 0m21.653s
sys 0m0.440s

Not using psyco is faster!

--
fynali

Jan 14 '06 #21
> > b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))

file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfals e(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))

--
$ time ./cleanup_ray.py

real 0m5.451s
user 0m4.496s
sys 0m0.428s

(-: Damn! That saves a bit more time! Bravo!

[bo****@gmail.com] Have you tried the explicit loop variant with psyco ? My experience is
that psyco is pretty good at optimizing for loop which usually results
in faster code than even built-in map/filter variant.

Though it would just be 1 or 2 sec difference(given what you already
have) so may not be important but could be fun.


The code is pretty tight and is now most likely I/O bound. If so,
further speed-ups will be hard to come by (even with psyco). The four
principal steps of reading, membership testing, filtering, and writing
are all C coded methods which are directly linked together with no
interpreter loop overhead or method lookups. Hard to beat.

Jan 14 '06 #22
On 13 Jan 2006 23:17:05 -0800, bo****@gmail.com wrote:

fynali wrote:
$ cat cleanup_ray.py
#!/usr/bin/python
import itertools

b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))

file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfals e(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))

--
$ time ./cleanup_ray.py

real 0m5.451s
user 0m4.496s
sys 0m0.428s

(-: Damn! That saves a bit more time! Bravo!

Thanks to you Raymond.

Have you tried the explicit loop variant with psyco ? My experience is
that psyco is pretty good at optimizing for loop which usually results
in faster code than even built-in map/filter variant.

Though it would just be 1 or 2 sec difference(given what you already
have) so may not be important but could be fun.

OTOH, when you are dealing with large files and near-optimal simple processing you are
likely to be comparing i/o-bound processes, meaning differences observed
will be symptoms of os and file system performance more than of the algorithms.

An exception is when a slight variation in algorithm can cause a large change
in i/o performance, such as if it causes physical seek and read patterns of disk
access that the OS/file_system and disk interface hardware can't entirely optimize out
with smart buffering etc. Not to mention possible interactions with all the other things
an OS may be doing "simultaneously" switching between things that it accounts for as real/user/sys.

Regards,
Bengt Richter
Jan 14 '06 #23

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1198
by: John Spiegel | last post by:
Hi all, I'm trying to efficiently pull data out of an xml file into a XmlDocument AND create another "sub" document based on one subtree of the document. For example, say I've got: <Books>...
2
4317
by: thotashireesh | last post by:
Hi All, I am using a stl::map and my key is a structure which looks like struct key { int id; char* name; };
4
7003
by: terry | last post by:
could someone tell me how to add or remove entity to a xml file when i dim xmlentity as new xmlentity it's say it's sube new is private thks
18
3003
by: David Buchan | last post by:
Hi guys, This may be a dumb question; I'm just getting into C language here. I wrote a program to unpack a binary file and write out the contents to a new file as a list of unsigned integers....
11
2092
by: Ignacio X. Domínguez | last post by:
Hi. I'm developing a desktop application that needs to store some data in a local file. Let's say for example that I want to have an address book with names and phone numbers in a file. I would...
3
2640
by: Markus | last post by:
Hi! I wanted to select a subset of nodes (list = selectNodes("parent/child") from a XmlDocument, then remove all (parentNode.removeAll();) child-nodes and insert the previous selected nodes...
148
5039
by: BillJosephson | last post by:
Want to do OOP. Does c++ have all the abilities of java, or is it some subset? Thanks...
1
1070
by: Terry Reedy | last post by:
Dan Stromberg wrote: Since you do not need all 10**6 files sorted, you might also try the heapq module. The entries into the heap would be (time, fileid)
0
7265
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7171
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
1
7111
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7539
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
1
5095
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4751
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3240
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
1
807
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
461
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.