Just started thinking about learning python.
Is there any place where I can get some free examples, especially for
following kind of problem ( it must be trivial for those using python)
I have files A, and B each containing say 100,000 lines (each line=one
string without any space)
I want to do
" A - (A intersection B) "
Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.
Jul 18 '05
13 11796
The point is that when you have 100,000s of records, this grep becomes
really slow?
Any comments?
Thats why I looked for python :) that would be
grep -vf B A
and it is a rare use of grep, indeed. -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually...
On Fri, 17 Dec 2004 12:21:08 +0000, rumours say that P@draigBrady.co m
might have written:
[snip some damn lie aka "benchmark"]
[me] (Yes, I cheated by adding the F (for no regular expressions) flag :) Also you only have 1000 entries in B! Try it again with all entries in B also ;-) Remember the original poster had 100K entries!
Well, that's the closest I can do:
$ py
Python 2.4c1 (#3, Nov 26 2004, 23:39:44)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright" , "credits" or "license" for more information. import sys; sys.ps1='.>>'
..>> alist=[line.strip() for line in open('/usr/share/dict/words')]
..>> words=set()
..>> for word in alist:
.... words.add(word + '\n')
.... words.add(word[::-1] + '\n')
....
..>> len(words)
90525
..>> words=list(word s)
..>> open('/tmp/A', 'w').writelines (words)
..>> import random; random.shuffle( words)
..>> open('/tmp/B', 'w').writelines (words[:90000])
..>>
$ time sort A B B | uniq -u >/dev/null
real 0m2.408s
user 0m2.437s
sys 0m0.037s
$ time grep -Fvf B A >/dev/null
real 0m1.208s
user 0m1.161s
sys 0m0.035s
What now?-)
Mind you, I only replied in the first place because you wrote (my
emphasis) "...here is *the* unix way..." and it's the bad days of the
month (not mine, actually, but I suffer along...)
and finally destroys original line order (should it be important).
true
That's our final agreement :)
Note the order is trivial to restore with a "decorate-sort-undecorate" idiom.
Using python or unix tools (eg 'paste -d', 'sort -k', 'cut -d')?
Because the python way has been already discussed by Friedrik, John and
Tim, and the unix way gets overly complicated (aka non-trivial) if DSU
is involved.
BTW, the following occurred to me:
tzot@tril/tmp
$ cat >A
aa
ss
dd
ff
gg
hh
jj
kk
ll
aa
tzot@tril/tmp
$ cat >B
ss
ff
hh
kk
tzot@tril/tmp
$ sort A B B | uniq -u
dd
gg
jj
ll
tzot@tril/tmp
$ grep -Fvf B A
aa
dd
gg
jj
ll
aa
Note that 'aa' is contained twice in the A file (to be filtered by B).
So our methods do not produce the same output. As far as the OP wrote:
Essentially, want to do efficient grep, i..e from A remove those lines which are also present in file B.
grep is the unix way to go for both speed and correctness.
I would call this issue a dead horse.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
sf wrote: The point is that when you have 100,000s of records, this grep becomes really slow?
There are performance bugs with current versions of grep
and multibyte characters that are only getting addressed now.
To work around these do `export LANG=C` first.
In my experience grep is not scalable since it's O(n^2).
See below (note A and B are randomized versions of
/usr/share/dict/words (and therefore worst case for the
sort method)).
$ wc -l A B
45427 A
45427 B
$ export LANG=C
$ time grep -Fvf B A
real 0m0.437s
$ time sort A B B | uniq -u
real 0m0.262s
$ rpm -q grep coreutils
grep-2.5.1-16.1
coreutils-4.5.3-19
--
Pádraig Brady - http://www.pixelbeat.org
--
On Fri, 17 Dec 2004 14:22:34 +0000, rumours say that P@draigBrady.co m
might have written:
sf: sf wrote: The point is that when you have 100,000s of records, this grep becomes really slow? There are performance bugs with current versions of grep and multibyte characters that are only getting addressed now. To work around these do `export LANG=C` first.
You also should use the -F flag that Pádraig suggests, since you don't
have regular expressions in the B file.
In my experience grep is not scalable since it's O(n^2). See below (note A and B are randomized versions of /usr/share/dict/words (and therefore worst case for the sort method)).
$ wc -l A B 45427 A 45427 B
$ export LANG=C
$ time grep -Fvf B A real 0m0.437s
$ time sort A B B | uniq -u real 0m0.262s
$ rpm -q grep coreutils grep-2.5.1-16.1 coreutils-4.5.3-19
sf, you better do your own benchmarks (there is quick, sample code in
other posts of mine and Pádraig's) on your machine, since on my test
machine the numbers are reversed re to these of Pádraig's (grep takes
half the time).
package versions (on SuSE 9.1 64-bit):
$ rpm -q grep coreutils
grep-2.5.1-427
coreutils-5.2.1-21
language:
$ echo $LANG
en_US.UTF-8
Caution: both solutions are interexchangeab le as long as you don't have
duplicate lines in the A file. If you do, use the grep version.
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually... This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: David Isaac |
last post by:
What's the standard replacement for the obsolete grep module?
Thanks,
Alan Isaac
|
by: Madhusudhanan Chandrasekaran |
last post by:
Hi:
This question is not directed "entirely" at python only. But since
I want to know how to do it in python, I am posting here.
I am constructing a huge matrix (m x n), whose columns n are stored in
smaller files. Once I read m such files, my matrix is complete. I want to
pass this matrix as an input to another script of mine (I just have the
binary.) Currently, the script reads a file (which is nothing but the
|
by: noro |
last post by:
Is there a more efficient method to find a string in a text file then:
f=file('somefile')
for line in f:
if 'string' in line:
print 'FOUND'
?
BTW:
|
by: js |
last post by:
Just my curiosity.
Can python beats perl at speed of grep-like processing?
$ wget http://www.gutenberg.org/files/7999/7999-h.zip
$ unzip 7999-h.zip
$ cd 7999-h
$ cat *.htm bigfile
$ du -h bigfile
du -h bigfile
|
by: tereglow |
last post by:
Hello all,
I come from a shell/perl background and have just to learn python. To
start with, I'm trying to obtain system information from a Linux
server using the /proc FS. For example, in order to obtain the amount
of physical memory on the server, I would do the following in shell:
grep ^MemTotal /proc/meminfo | awk '{print $2}'
That would get me the exact number that I need. Now, I'm trying to do
| |
by: Manogna |
last post by:
hi !
I have two big files like A anb B.
A contains nearly 1000 lines.
and B contains 1500000 lines.
|
by: Anton Slesarev |
last post by:
I've read great paper about generators:
http://www.dabeaz.com/generators/index.html
Author say that it's easy to write analog of common linux tools such
as awk,grep etc. He say that performance could be even better.
But I have some problem with writing performance grep analog.
It's my script:
|
by: Henning_Thornblad |
last post by:
What can be the cause of the large difference between re.search and
grep?
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
| |
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |