Hi all,
I have two files:
- PSP0000320.dat (quite a large list of mobile numbers),
- CBR0000319.dat (a subset of the above, a list of barred bumbers)
# head PSP0000320.dat CBR0000319.dat
==> PSP0000320.dat <==
96653696338
96653766996
96654609431
96654722608
96654738074
96655697044
96655824738
96656190117
96656256762
96656263751
==> CBR0000319.dat <==
96651131135
96651131135
96651420412
96651730095
96652399117
96652399142
96652399142
96652399142
96652399160
96652399271
Objective: to remove the numbers present in barred-list from the
PSPfile.
$ ls -lh PSP0000320.dat CBR0000319.dat
... 56M Dec 28 19:41 PSP0000320.dat
... 8.6M Dec 28 19:40 CBR0000319.dat
$ wc -l PSP0000320.dat CBR0000319.dat
4,462,603 PSP0000320.dat
693,585 CBR0000319.dat
I wrote the following in python to do it:
#: c01:rmcommon.py
barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')
# reading it all in one go, so as to avoid frequent disk accesses
(assume machine has plenty memory)
barredlist.read()
postlist.read()
#
for number in postlist:
if number in barrlist:
pass
else:
outfile.write(number)
barredlist.close(); postlist.close(); outfile.close()
#:~
The above code simply takes too long to complete. If I were to do a
diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with
sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to
complete.
I wrote the following in bash to do the same:
#!/bin/bash
ARGS=2
if [ $# -ne $ARGS ] # takes two arguments
then
echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
echo; echo " eg.: `basename $0` PSP0000320.dat
CBR0000319.dat"; echo;
echo "NOTE: first argument: PSP file, second: CBR file";
echo " this script _does_ no_ input validation!"
exit 1
fi;
# fix prefix; cost: 12.587 secs
cat $1 | sed -e 's/^0*/966/' > $1.good
cat $2 | sed -e 's/^0*/966/' > $2.good
# sort/save files; for the 4,462,603 lines, cost: 36.589 secs
sort $1.good > $1.sorted
sort $2.good > $2.sorted
# diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
diff -y $1.sorted $2.sorted | grep "<" > $1.filtered
# remove trailing junk [spaces & <]; cost: 1 min 3 secs
cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned
# remove intermediate files, good, sorted, filtered
rm -f *.good *.sorted *.filtered
#:~
....but strangely though, there's a discrepancy, the reason for which I
can't figure out!
Needless to say, I'm utterly new to python and my programming skills &
know-how are rudimentary.
Any help will be genuinely appreciated.
--
fynali 22 1719
[fynali] I have two files:
- PSP0000320.dat (quite a large list of mobile numbers), - CBR0000319.dat (a subset of the above, a list of barred bumbers)
# print all non-barred mobile phone numbers
barred = set(open('CBR0000319.dat'))
for num in open('PSP0000320.dat'):
if num not in barred:
print num,
Raymond
On 12 Jan 2006 09:04:21 -0800
"fynali" <il******@gmail.com> wrote: Hi all,
I have two files:
- PSP0000320.dat (quite a large list of mobile numbers), - CBR0000319.dat (a subset of the above, a list of barred bumbers)
....
Objective: to remove the numbers present in barred-list from the PSPfile.
How fast does this run?
a = set(file('PSP0000320.dat'))
b = set(file('CBR0000319.dat'))
file('PSP-CBR.dat', 'w').writelines(a.difference(b))
AJL
"fynali" <il******@gmail.com> writes: Hi all,
I have two files:
Others have pointed out the Python solution - use a set instead of a
list for membership testing. I want to point out a better Unix
solution ('cause I probably wouldn't have written a Python program to
do this):
Objective: to remove the numbers present in barred-list from the PSPfile.
$ ls -lh PSP0000320.dat CBR0000319.dat ... 56M Dec 28 19:41 PSP0000320.dat ... 8.6M Dec 28 19:40 CBR0000319.dat
$ wc -l PSP0000320.dat CBR0000319.dat 4,462,603 PSP0000320.dat 693,585 CBR0000319.dat
I wrote the following in bash to do the same:
#!/bin/bash
ARGS=2
if [ $# -ne $ARGS ] # takes two arguments then echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}" echo; echo " eg.: `basename $0` PSP0000320.dat CBR0000319.dat"; echo; echo "NOTE: first argument: PSP file, second: CBR file"; echo " this script _does_ no_ input validation!" exit 1 fi;
# fix prefix; cost: 12.587 secs cat $1 | sed -e 's/^0*/966/' > $1.good cat $2 | sed -e 's/^0*/966/' > $2.good
# sort/save files; for the 4,462,603 lines, cost: 36.589 secs sort $1.good > $1.sorted sort $2.good > $2.sorted
# diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs diff -y $1.sorted $2.sorted | grep "<" > $1.filtered
# remove trailing junk [spaces & <]; cost: 1 min 3 secs cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned
# remove intermediate files, good, sorted, filtered rm -f *.good *.sorted *.filtered #:~
...but strangely though, there's a discrepancy, the reason for which I can't figure out!
The above script can be shortened quite a bit:
#!/bin/sh
comm -23 <(sed 's/^0*/966/' $1 | sort) <(sed 's/^0*/966/ $2 | sort)
Will output only lines that occur in $1. It also runs the seds and
sorts in parallel, which can make a significant difference in the
clock time it takes to get the job done.
The Python version is probably faster, since it doesn't sort the
data.
<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
On 01/12/2006-09:04AM, fynali wrote: - PSP0000320.dat (quite a large list of mobile numbers), - CBR0000319.dat (a subset of the above, a list of barred bumbers)
fgrep -x -v -f CBR0000319.dat PSP0000320.dat > PSP-CBR.dat
AJL wrote: How fast does this run?
a = set(file('PSP0000320.dat')) b = set(file('CBR0000319.dat')) file('PSP-CBR.dat', 'w').writelines(a.difference(b))
Turning PSP into a set takes extra time, consumes unnecessary memory,
eliminates duplicates (possibly a bad thing), and loses the original
input ordering (probably a bad thing).
To jam the action into a couple lines, try this:
b = set(file('CBR0000319.dat'))
file('PSP-CBR.dat','w').writelines(itertools.ifilterfalse(b. __contains__,file('PSP0000320.dat')))
Raymond
On 12 Jan 2006 22:29:22 -0800
"Raymond Hettinger" <py****@rcn.com> wrote: AJL wrote: How fast does this run?
a = set(file('PSP0000320.dat')) b = set(file('CBR0000319.dat')) file('PSP-CBR.dat', 'w').writelines(a.difference(b))
Turning PSP into a set takes extra time, consumes unnecessary memory, eliminates duplicates (possibly a bad thing), and loses the original input ordering (probably a bad thing).
To jam the action into a couple lines, try this:
b = set(file('CBR0000319.dat')) file('PSP-CBR.dat','w').writelines(itertools.ifilterfalse(b. __contains__,file('PSP0000320.dat')))
Raymond
The OP said "assume machine has plenty memory". ;)
I saw some solutions that used sets and was wondering why they stopped
at using a set for the first file and not the second when the problem is
really a set problem but I can see the reasoning behind it now.
AJL
$ time fgrep -x -v -f CBR0000333 PSP0000333 > PSP-CBR.dat.fgrep
real 0m31.551s
user 0m16.841s
sys 0m0.912s
--
$ time ./cleanup.py
real 0m6.080s
user 0m4.836s
sys 0m0.408s
--
$ wc -l PSP-CBR.dat.fgrep PSP-CBR.dat.python
3872421 PSP-CBR.dat.fgrep
3872421 PSP-CBR.dat.python
Fantastic, at any rate the time is down from my initial ~4 min.!
Thank you Chris. The fgrep approach is clean and to the point; and one
more reason to love the *nix approach to handling everyday problems.
Fredrik's set|dict approach in Python above gives me one more reason to
love Python. And it is indeed fast, 5x!
Thank you all for all your help.
--
fynali
$ cat cleanup_ray.py
#!/usr/bin/python
import itertools
b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))
file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfals e(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))
--
$ time ./cleanup_ray.py
real 0m5.451s
user 0m4.496s
sys 0m0.428s
(-: Damn! That saves a bit more time! Bravo!
Thanks to you Raymond.
fynali wrote: $ cat cleanup_ray.py #!/usr/bin/python import itertools
b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))
file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfals e(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))
-- $ time ./cleanup_ray.py
real 0m5.451s user 0m4.496s sys 0m0.428s
(-: Damn! That saves a bit more time! Bravo!
Thanks to you Raymond.
Have you tried the explicit loop variant with psyco ? My experience is
that psyco is pretty good at optimizing for loop which usually results
in faster code than even built-in map/filter variant.
Though it would just be 1 or 2 sec difference(given what you already
have) so may not be important but could be fun.
--
$ ./cleanup.py
Traceback (most recent call last):
File "./cleanup.py", line 3, in ?
import itertools
ImportError: No module named itertools
--
$ time ./cleanup.py
File "./cleanup.py", line 8
outfile.writelines(number for number in postpaid_file if number
not in barred)
^
SyntaxError: invalid syntax
The earlier results I posted were run on my workstation which has
Python 2.4.1,
$ uname -a && python -V
Linux sajid 2.6.13-15.7-smp #1 SMP
Tue Nov 29 14:32:29 UTC 2005 i686 i686 i386 GNU/Linux
Python 2.4.1
but the server on which the actual processing will be done has an older
version )-:
$ uname -a && python -V
Linux cactus 2.4.21-20.ELsmp #1 SMP
Wed Aug 18 20:46:40 EDT 2004 i686 i686 i386 GNU/Linux
Python 2.2.3
Is a rewrite possible of Raymond's or Fredrik's suggestions above which
will still give me the time saving made?
--
fynali
[bonono] Have you tried the explicit loop variant with psyco ?
Sure I wouldn't mind trying; can you suggest some code snippets along
the lines of which I should try...?
[fynali] Needless to say, I'm utterly new to python and my programming skills & know-how are rudimentary.
(-:
--
fynali
"fynali" wrote: Is a rewrite possible of Raymond's or Fredrik's suggestions above which will still give me the time saving made?
Python 2.2 don't have a readymade set type (new in 2.3), and it doesn't
support generator expressions (the thing that caused the syntax error).
however, using a dictionary instead of the set
barred = {}
for number in open(open('/home/sjd/python/wip/CBR0000319.dat')):
barred[number] = None # just add it as a key
and a list comprehension instead of the generator expression
outfile.writelines([number for number in infile if number not in barred])
(note the extra brackets)
should give you decent performance under 2.2.
</F>
fynali wrote: [bonono] Have you tried the explicit loop variant with psyco ?
Sure I wouldn't mind trying; can you suggest some code snippets along the lines of which I should try...?
[fynali] > Needless to say, I'm utterly new to python and my programming > skills & know-how are rudimentary.
(-:
-- fynali
just add :
import psyco
psyco.full()
to the beginning of the code from Fredrik for 2.2(the one using dict
and list comprehension). You system may not have the psyco module
though.
$ cat cleanup.py
#!/usr/bin/python
postpaid_file = open('/home/oracle/stc/test/PSP0000333')
outfile = open('/home/oracle/stc/test/PSP-CBR.dat', 'w')
barred = {}
for number in open('/home/oracle/stc/test/CBR0000333'):
barred[number] = None # just add it as a key
outfile.writelines([number for number in postpaid_file if number
not in barred])
postpaid_file.close(); outfile.close()
--
$ time ./cleanup.py
real 0m31.007s
user 0m24.660s
sys 0m3.550s
Can we say that using generators & newer Python _is_ faster?
--
fynali
$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python
import psyco
psyco.full()
postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')
barred = {}
for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key
outfile.writelines([number for number in postpaid_file if number
not in barred])
postpaid_file.close(); outfile.close()
--
$ time ./cleanup_use_psyco_and_list_compr.py
real 0m39.638s
user 0m5.532s
sys 0m0.868s
This was run on my machine (w/ Python 2.4.1), can't install psyco on
the actual server at the moment.
I guess using generators & newer Python is indeed faster|better.
--
fynali
fynali wrote: $ cat cleanup_use_psyco_and_list_compr.py #!/usr/bin/python
import psyco psyco.full()
postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333') outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco', 'w')
barred = {}
for number in open('/home/sajid/python/wip/stc/2/CBR0000333'): barred[number] = None # just add it as a key
outfile.writelines([number for number in postpaid_file if number not in barred])
postpaid_file.close(); outfile.close()
-- $ time ./cleanup_use_psyco_and_list_compr.py
real 0m39.638s user 0m5.532s sys 0m0.868s
This was run on my machine (w/ Python 2.4.1), can't install psyco on the actual server at the moment.
I guess using generators & newer Python is indeed faster|better.
-- fynali
um, strange, so psyco is slower than not using it ?
you may try to expand the list comprehension to :
for number in postpaid_file:
if number not in barred: outfile.writelines(number)
$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python
import psyco
psyco.full()
postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')
barred = {}
for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key
for number in postpaid_file:
if number not in barred: outfile.writelines(number)
postpaid_file.close(); outfile.close()
--
$ time ./cleanup_use_psyco_and_list_compr.py
real 0m24.293s
user 0m22.633s
sys 0m0.524s
Saves ~6 secs.
--
fynali
Sorry, pls read that ~15 secs.
fynali wrote: Sorry, pls read that ~15 secs.
That is more or less about it. As set() is faster than dict(), about 2x
on my machine and I assume a portion of your time is in set/dict
creation as it is pretty large data set.
$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python
#import psyco
#psyco.full()
postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')
barred = {}
for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key
for number in postpaid_file:
if number not in barred: outfile.writelines(number)
postpaid_file.close(); outfile.close()
--
$ time ./cleanup_use_psyco_and_list_compr.py
real 0m22.587s
user 0m21.653s
sys 0m0.440s
Not using psyco is faster!
--
fynali
> > b = set(file('/home/sajid/python/wip/stc/2/CBR0000333')) file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfals e(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))
-- $ time ./cleanup_ray.py
real 0m5.451s user 0m4.496s sys 0m0.428s
(-: Damn! That saves a bit more time! Bravo!
[bo****@gmail.com] Have you tried the explicit loop variant with psyco ? My experience is that psyco is pretty good at optimizing for loop which usually results in faster code than even built-in map/filter variant.
Though it would just be 1 or 2 sec difference(given what you already have) so may not be important but could be fun.
The code is pretty tight and is now most likely I/O bound. If so,
further speed-ups will be hard to come by (even with psyco). The four
principal steps of reading, membership testing, filtering, and writing
are all C coded methods which are directly linked together with no
interpreter loop overhead or method lookups. Hard to beat.
On 13 Jan 2006 23:17:05 -0800, bo****@gmail.com wrote: fynali wrote: $ cat cleanup_ray.py #!/usr/bin/python import itertools
b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))
file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfals e(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))
-- $ time ./cleanup_ray.py
real 0m5.451s user 0m4.496s sys 0m0.428s
(-: Damn! That saves a bit more time! Bravo!
Thanks to you Raymond. Have you tried the explicit loop variant with psyco ? My experience is that psyco is pretty good at optimizing for loop which usually results in faster code than even built-in map/filter variant.
Though it would just be 1 or 2 sec difference(given what you already have) so may not be important but could be fun.
OTOH, when you are dealing with large files and near-optimal simple processing you are
likely to be comparing i/o-bound processes, meaning differences observed
will be symptoms of os and file system performance more than of the algorithms.
An exception is when a slight variation in algorithm can cause a large change
in i/o performance, such as if it causes physical seek and read patterns of disk
access that the OS/file_system and disk interface hardware can't entirely optimize out
with smart buffering etc. Not to mention possible interactions with all the other things
an OS may be doing "simultaneously" switching between things that it accounts for as real/user/sys.
Regards,
Bengt Richter This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: John Spiegel |
last post by:
Hi all,
I'm trying to efficiently pull data out of an xml file into a XmlDocument
AND create another "sub" document based on one subtree of the document. For
example, say I've got:
<Books>...
|
by: thotashireesh |
last post by:
Hi All,
I am using a stl::map and my key is a structure which looks like
struct key
{
int id;
char* name;
};
|
by: terry |
last post by:
could someone tell me how to add or remove entity to a xml file
when i dim xmlentity as new xmlentity
it's say it's sube new is private
thks
|
by: David Buchan |
last post by:
Hi guys,
This may be a dumb question; I'm just getting into C language here.
I wrote a program to unpack a binary file and write out the contents to
a new file as a list of unsigned integers....
|
by: Ignacio X. Domínguez |
last post by:
Hi. I'm developing a desktop application that needs to store some data in a
local file. Let's say for example that I want to have an address book with
names and phone numbers in a file. I would...
| |
by: Markus |
last post by:
Hi!
I wanted to select a subset of nodes (list =
selectNodes("parent/child") from a XmlDocument, then remove all
(parentNode.removeAll();) child-nodes and insert the previous selected nodes...
|
by: BillJosephson |
last post by:
Want to do OOP. Does c++ have all the abilities of java, or is it some
subset?
Thanks...
|
by: Terry Reedy |
last post by:
Dan Stromberg wrote:
Since you do not need all 10**6 files sorted, you might also try the
heapq module. The entries into the heap would be (time, fileid)
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
| |
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The...
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
|
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...
| |