473,795 Members | 2,667 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

help make it faster please

I wrote this function which does the following:
after readling lines from file.It splits and finds the word occurences
through a hash table...for some reason this is quite slow..can some one
help me make it faster...
f = open(filename)
lines = f.readlines()
def create_words(li nes):
cnt = 0
spl_set = '[",;<>{}_&?! ():-[\.=+*\t\n\r]+'
for content in lines:
words=content.s plit()
countDict={}
wordlist = []
for w in words:
w=string.lower( w)
if w[-1] in spl_set: w = w[:-1]
if w != '':
if countDict.has_k ey(w):
countDict[w]=countDict[w]+1
else:
countDict[w]=1
wordlist = countDict.keys( )
wordlist.sort()
cnt += 1
if countDict != {}:
for word in wordlist: print (word+' '+
str(countDict[word])+'\n')

Nov 10 '05
19 1990
Lonnie Princehouse wrote:
"[a-z0-9_]" means "match a single character from the set {a through z,
0 through 9, underscore}".


"\w" should be a bit faster; it's equivalent to "[a-zA-Z0-9_]" (unless you
specify otherwise using the locale or unicode flags), but is handled more
efficiently by the RE engine.

(you can use \w in sets too, so you can do e.g. "[\w@]")

</F>

Nov 10 '05 #11
<pk******@gmail .com> wrote:
Oh sorry indentation was messed here...the
wordlist = countDict.keys( )
wordlist.sort( )
should be outside the word loop.... now
def create_words(li nes):
cnt = 0
spl_set = '[",;<>{}_&?! ():-[\.=+*\t\n\r]+'
for content in lines:
words=content.s plit()
countDict={}
wordlist = []
for w in words:
w=string.lower( w)
if w[-1] in spl_set: w = w[:-1]
if w != '':
if countDict.has_k ey(w):
countDict[w]=countDict[w]+1
else:
countDict[w]=1
wordlist = countDict.keys( )
wordlist.sort()
cnt += 1
if countDict != {}:
for word in wordlist: print (word+' '+
str(countDic t[word])+'\n')

ok now this is the correct question I am asking...


(a) You might be better off doing:
words = words.lower()
for w in words:
...
instead of calling lower() on each separate word (and note that most
functions from string are deprecated in favour of string methods).

(b) spl_set isn't doing what you might think it is -- it looks like
you've written it as a regexp but your using it as a character set.
What you might want is:
spl_set = '",;<>{}_&?! ():-[\.=+*\t\n\r'
and
while w[-1] in spl_set: w = w[:-1]
That loop can be written:
w = w.rstrip(spl_se t)
(which by my timings is faster if you have multiple characters from
spl_set at the end of your word, but slower if you have 0 or 1).

--
\S -- si***@chiark.gr eenend.org.uk -- http://www.chaos.org.uk/~sion/
___ | "Frankly I have no feelings towards penguins one way or the other"
\X/ | -- Arthur C. Clarke
her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump
Nov 11 '05 #12
On 10 Nov 2005 10:43:04 -0800, be************@ lycos.com wrote:
This can be faster, it avoids doing the same things more times:

from string import maketrans, ascii_lowercase , ascii_uppercase

def create_words(af ile):
stripper = """'[",;<>{}_&?! ():[]\.=+-*\t\n\r^%012345 6789/"""
mapper = maketrans(strip per + ascii_uppercase ,
" "*len(strip per) + ascii_lowercase ) good way to prepare for split
countDict = {}
for line in afile:
for w in line.translate( mapper).split() :
if w: I suspect it's not possible to get '' in the list from somestring.spli t() if w in countDict:
countDict[w] += 1
else:
countDict[w] = 1 does that beat the try and get versions? I.e., (untested)
try: countDict[w] += 1
except KeyError: countDict[w] = 1
or
countDict[w] = countDict.get(w , 0) + 1
word_freq = countDict.items ()
word_freq.sort( )
for word, freq in word_freq:
print word, freq

create_words(f ile("test.txt") )
If you can load the whole file in memory then it can be made a little
faster...

Bear hugs,
bearophile


Regards,
Bengt Richter
Nov 22 '05 #13
Bengt Richter enlightened us with:
I suspect it's not possible to get '' in the list from
somestring.spli t()
Time to adjust your suspicions:
';abc;'.split(' ;')
['', 'abc', '']

countDict[w] += 1
else:
countDict[w] = 1

does that beat the try and get versions? I.e., (untested)
try: countDict[w] += 1
except KeyError: countDict[w] = 1


OPs way is faster. A 'try' block is very fast, but trowing & handling
an exception is slow.
countDict[w] = countDict.get(w , 0) + 1


I think this has the potential of being faster than OPs method,
because it's likely to be fully implemented in C.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Nov 22 '05 #14
Thank you Bengt Richter and Sybren Stuvel for your comments, my little
procedure can be improved a bit in many ways, it was just a first
quickly written version (but it can be enough for a basic usage).

Bengt Richter:
good way to prepare for split
Maybe there is a better way, that is putting in just the accepted
letters and accented letters, instead of the not accepted symbols.

I suspect it's not possible to get '' in the list from somestring.spli t()
You are probably right, the algorithm used is different if you don't
give a splitting string:
' abc a '.split() ['abc', 'a'] '.abc..a.'.spli t(".")

['', 'abc', '', 'a', '']

does that beat the try and get versions? I.e., (untested)
Yes.

countDict[w] = countDict.get(w , 0) + 1


I think the if-else version is the fastest, I have tested it a long
time ago... You can easely do a speed test to see if I am wrong.

Bear hugs,
bearophile

Nov 22 '05 #15
On Sat, 12 Nov 2005 10:46:53 +0100, Sybren Stuvel <sy*******@YOUR thirdtower.com. imagination> wrote:
Bengt Richter enlightened us with:
I suspect it's not possible to get '' in the list from
somestring.spli t()
Time to adjust your suspicions:
';abc;'.split(' ;')['', 'abc', '']

I know about that one ;-)
I meant somestring.spli t() just like that -- without a splitter argument.
My suspicion remains ;-)

countDict[w] += 1
else:
countDict[w] = 1 does that beat the try and get versions? I.e., (untested)
try: countDict[w] += 1
except KeyError: countDict[w] = 1


OPs way is faster. A 'try' block is very fast, but trowing & handling
an exception is slow.

Yes, so it's data sensitive. Same with the other one not mentioned:
countDict.setde fault(w, [0])[0] += 1 # accum count in length one list vs as int
# too much subscripting work though
countDict[w] = countDict.get(w , 0) + 1


I think this has the potential of being faster than OPs method,
because it's likely to be fully implemented in C.

Not sure what you mean. Don't forget, all those expressions are evaluated dynamically
step by step with byte codes being switched on in c, but still a bunch of steps. I.e.,
import dis
dis.dis(compile ('countDict[w] = countDict.get(w , 0) + 1', '', 'exec')) 1 0 LOAD_NAME 0 (countDict)
3 LOAD_ATTR 1 (get)
6 LOAD_NAME 2 (w)
9 LOAD_CONST 0 (0)
12 CALL_FUNCTION 2
15 LOAD_CONST 1 (1)
18 BINARY_ADD
19 LOAD_NAME 0 (countDict)
22 LOAD_NAME 2 (w)
25 STORE_SUBSCR
26 LOAD_CONST 2 (None)
29 RETURN_VALUE

Looks like if you succeed most of the time without a KeyError, the try/except might be
a tad faster, depending on what the exception block setup/teardown costs.
dis.dis(compile ("""\
... try: countDict[w] += 1
... except KeyError: countDict[w] = 1
... """, '', 'exec'))
1 0 SETUP_EXCEPT 20 (to 23)
3 LOAD_NAME 0 (countDict)
6 LOAD_NAME 1 (w)
9 DUP_TOPX 2
12 BINARY_SUBSCR
13 LOAD_CONST 0 (1)
16 INPLACE_ADD
17 ROT_THREE
18 STORE_SUBSCR
19 POP_BLOCK
20 JUMP_FORWARD 29 (to 52)

2 >> 23 DUP_TOP
24 LOAD_NAME 2 (KeyError)
27 COMPARE_OP 10 (exception match)
30 JUMP_IF_FALSE 17 (to 50)
33 POP_TOP
34 POP_TOP
35 POP_TOP
36 POP_TOP
37 LOAD_CONST 0 (1)
40 LOAD_NAME 0 (countDict)
43 LOAD_NAME 1 (w)
46 STORE_SUBSCR
47 JUMP_FORWARD 2 (to 52) 50 POP_TOP 51 END_FINALLY 52 LOAD_CONST 1 (None)

55 RETURN_VALUE

Regards,
Bengt Richter
Nov 22 '05 #16
Bengt Richter enlightened us with:
I meant somestring.spli t() just like that -- without a splitter
argument. My suspicion remains ;-)


Mine too ;-)

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Nov 22 '05 #17


Fredrik Lundh wrote:
Lonnie Princehouse wrote:

"[a-z0-9_]" means "match a single character from the set {a through z,
0 through 9, underscore}".

"\w" should be a bit faster; it's equivalent to "[a-zA-Z0-9_]" (unless you
specify otherwise using the locale or unicode flags), but is handled more
efficiently by the RE engine.

(you can use \w in sets too, so you can do e.g. "[\w@]")

</F>


The \w does make a small difference, but not as much as I expected.
Surprisingly a standard Python word iterator works just as well, and is
easier to understand than the re version.

Which one is faster depends on the average word length and number of
ignored characters.

Cheers,
Ron


Character count: 100000
Word count: 16477
Average word size: 6.06906597075
word_counter: 0.06820057 (best of 3)
count_words: 0.07333837 (best of 3)
#########

import string
import re
import time
import random
# Create a really ugly n length string to test with.
n = 100000
random.seed(1)
lines = ''.join([ random.choice(s tring.ascii_let ters * 2
+ '_@$&*()#/<>' + ' \n' * 6) for x in range(n) ])
print 'Character count:', n
print 'Word count:', len(lines.split ())
print 'Average word size:', float(n)/len(lines.split ())

letters = string.lowercas e + string.digits + '_@'

def word_iter(text, letters=letters ):
ltrs=letters.lo wer()
wd = ''
for c in text + ' ':
if c in ltrs:
wd += c
elif wd != '':
yield wd
wd = ''

def word_counter(te xt):
txt = text.lower()
countDict={}
for wd in word_iter(txt):
if wd in countDict:
countDict[wd] += 1
else:
countDict[wd] = 1
return countDict


word_finder = re.compile('[\w@]+', re.I)

def count_words(str ing, word_finder = word_finder):
# avoid global lookups
countDict = {}
for match in word_finder.fin diter(string.lo wer()):
word = match.group(0)
countDict[word] = countDict.get(w ord,0) + 1
return countDict

foos = [word_counter, count_words]
r1 = r2 = None
for foo in foos:
best_time = 0
for n in range(3):
t = time.clock()
for line in lines.splitline s():
countDict = foo(line)
tt = time.clock()-t
if tt > best_time: best_time = tt

r1 = r2
r2 = countDict
if r1 != None:
# change to 1 if assert fails to find problem
if 0:
for k in r1.keys():
if r1[k] != r2[k]:
print k,r1[k],r2[k]
assert r1 == r2

print '%s: %.8f (best of %d)' \
% (foo.__name__, best_time, n+1)

Nov 22 '05 #18
Ron Adam wrote:
The \w does make a small difference, but not as much as I expected.
that's probably because your benchmark has a lot of dubious overhead:
word_finder = re.compile('[\w@]+', re.I)
no need to force case-insensitive search here; \w looks for both lower-
and uppercase characters.
for match in word_finder.fin diter(string.lo wer()):
since you're using a case-insensitive RE, that lower() call is not necessary.
word = match.group(0)
and findall() is of course faster than finditer() + m.group().
t = time.clock()
for line in lines.splitline s():
countDict = foo(line)
tt = time.clock()-t


and if you want performance, why are you creating a new dictionary for
each line in the sample?

here's a more optimized RE word finder:

word_finder_2 = re.compile('[\w@]+').findall

def count_words_2(s tring, word_finder=wor d_finder_2):
# avoid global lookups
countDict = {}
for word in word_finder(str ing):
countDict[word] = countDict.get(w ord,0) + 1
return countDict

with your original test on a slow machine, I get

count_words: 0.29868684 (best of 3)
count_words_2: 0.17244873 (best of 3)

if I call the function once, on the entire sample string, I get

count_words: 0.23096036 (best of 3)
count_words_2: 0.11690620 (best of 3)

</F>

Nov 22 '05 #19


Fredrik Lundh wrote:
Ron Adam wrote:

The \w does make a small difference, but not as much as I expected.

that's probably because your benchmark has a lot of dubious overhead:


I think it does what the OP described, but that may not be what he
really needs.

Although the test to find best of n, instead was finding worse of n.
Which explains why I was getting a larger variance than I thought I
should have been getting.

word_finder = re.compile('[\w@]+', re.I)

no need to force case-insensitive search here; \w looks for both lower-
and uppercase characters.


But the dictionary keys need to be either upper or lower otherwise you
count 'The' separately from 'the'.

for match in word_finder.fin diter(string.lo wer()):

since you're using a case-insensitive RE, that lower() call is not necessary.

word = match.group(0)

and findall() is of course faster than finditer() + m.group().


Cool, I don't use re that often so I just used what was posted to test
against.
t = time.clock()
for line in lines.splitline s():
countDict = foo(line)
tt = time.clock()-t

and if you want performance, why are you creating a new dictionary for
each line in the sample?


Because that's what the OP apparently wanted. A line by line word
count. I originally did it to get an the over all count and then change
it so it matched the re version that was posted.
here's a more optimized RE word finder:

word_finder_2 = re.compile('[\w@]+').findall

def count_words_2(s tring, word_finder=wor d_finder_2):
# avoid global lookups
countDict = {}
for word in word_finder(str ing):
countDict[word] = countDict.get(w ord,0) + 1
return countDict

with your original test on a slow machine, I get

count_words: 0.29868684 (best of 3)
count_words_2: 0.17244873 (best of 3)

if I call the function once, on the entire sample string, I get

count_words: 0.23096036 (best of 3)
count_words_2: 0.11690620 (best of 3)

</F>


Wow, a lot bigger difference than on my machine. <curious> An athlon
64 3000+ on winxp. I'm not sure how much difference that would make?

This is what I get after adding the above version to it, with the
lower(). There's not quite as big a difference as you get, but the find
all version is still faster than both the others.

Cheers,
Ron
Character count: 100000
Word count: 16477
Average word size: 6.06906597075
word_counter: 0.06245989 (best of 3)
count_words: 0.07309812 (best of 3)
count_words_2: 0.04981024 (best of 3)
And as count all words...

Character count: 100000
Word count: 16477
Average word size: 6.06906597075
word_counter: 0.05325006 (best of 3)
count_words: 0.05910528 (best of 3)
count_words_2: 0.03748158 (best of 3)
They all improve, but the re find all version is clearly better.

############### ######

import string
import re
import time
import random

# Create a really ugly n length string to test with.
# The word length are
n = 100000
random.seed(1)
lines = ''.join([ random.choice(s tring.ascii_let ters * 2
+ '_@$&*()#/<>' + ' \n' * 6) for x in range(n) ])
print 'Character count:', n
print 'Word count:', len(lines.split ())
print 'Average word size:', float(n)/len(lines.split ())
letters = string.lowercas e + string.digits + '_@'

def word_iter(text, letters=letters ):
wd = ''
for c in text + ' ':
if c in letters:
wd += c
elif wd != '':
yield wd
wd = ''

def word_counter(te xt):
countDict={}
for wd in word_iter(text. lower()):
if wd in countDict:
countDict[wd] += 1
else:
countDict[wd] = 1
return countDict
word_finder = re.compile('[\w@]+', re.I).finditer

def count_words(str ing, word_finder=wor d_finder):
# avoid global lookups
countDict = {}
for match in word_finder(str ing.lower()):
word = match.group(0)
countDict[word] = countDict.get(w ord,0) + 1
return countDict
word_finder_2 = re.compile('[\w@]+').findall

def count_words_2(s tring, word_finder=wor d_finder_2):
# avoid global lookups
countDict = {}
for word in word_finder(str ing.lower()):
countDict[word] = countDict.get(w ord,0) + 1
return countDict
foos = [word_counter, count_words, count_words_2]
r1 = r2 = None
for foo in foos:
best_time = 1000000 # too large to be useful on purpose
for n in range(3):
t = time.clock()
#for line in lines.splitline s():
countDict = foo(lines)
tt = time.clock()-t
best_time = min(tt, best_time)

r1 = r2
r2 = countDict
if r1 != None:
# change to 1 if assert fails to find problem
if 0:
for k in r1.keys():
if r1[k] != r2[k]:
print k,r1[k],r2[k]
assert r1 == r2

print '%s: %.8f (best of %d)' \
% (foo.__name__, best_time, n+1)

Nov 22 '05 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2659
by: Orion | last post by:
Hi, This is kind of last minute, I have a day and a half left to figure this out. I'm working on a project using ms-sqlserver. We are creating a ticket sales system, as part of the system, I need to be able to do a search for specific tickets withing price ranges, different locations within the theaters, etc. etc. My problem is in the search one of the criteria is to search for a group of seats together. For example let's say...
24
2896
by: arcticool | last post by:
I had an interview today and I got destroyed :( The question was why have a stack and a heap? I could answer all the practical stuff like value types live on the stack, enums are on the stack, as are structs, where classes are on the heap... when value types go out of scope the memory is re- allocated, object remain in memory waiting to be cleaned up by the garbage collector, etc, but he responded 'so why not just put say a class on the...
5
7150
by: Martin | last post by:
I'm a newbie to Java and have today completed my first Java applet. It's a panorama viewer which i intend to use on my website instead of Apple's Quicktime Virtual Reality (QTVR) format. I've used the QTVR format for a while but think that i'd get more visitors with a Java applet - too few visitors have the QTVR browser plugin installed and with the Quicktime installation now a massive 34.8MB i feel i'm losing visitors because it's just...
31
4606
by: Extremest | last post by:
I have a loop that is set to run as long as the arraylist is > 0. at the beginning of this loop I grab the first object and then remove it. I then go into another loop that checks to see if there are more objects that match the first object that i grabbed. If they match then I put them in an array. I would like to remove each match from the arraylist as I find them to speed things up and so that they don't get checked again. If I try...
10
2186
by: Extremest | last post by:
I know there are ways to make this a lot faster. Any newsreader does this in seconds. I don't know how they do it and I am very new to c#. If anyone knows a faster way please let me know. All I am doing is quering the db for all the headers for a certain group and then going through them to find all the parts of each post. I only want ones that are complete. Meaning all segments for that one file posted are there. using System;
15
2584
by: Jay | last post by:
I have a multi threaded VB.NET application (4 threads) that I use to send text messages to many, many employees via system.timer at a 5 second interval. Basically, I look in a SQL table (queue) to determine who needs to receive the text message then send the message to the address. Only problem is, the employee may receive up to 4 of the same messages because each thread gets the recors then sends the message. I need somehow to prevent...
5
1434
by: Joel | last post by:
(1) Can anyone please tell me the exact meaning of primitive types in the MSIL context. (2) Also what is the meaning of the world inline? (3) What is the meaning of the statement: "You should bear in mind, however, that decimal is not implemented under the hood as a primitive type, so using decimal will have a performance impact on your calculations." ?
41
2706
by: c | last post by:
Hi every one, Me and my Cousin were talking about C and C#, I love C and he loves C#..and were talking C is ...blah blah...C# is Blah Blah ...etc and then we decided to write a program that will calculate the factorial of 10, 10 millions time and print the reusult in a file with the name log.txt.. I wrote something like this
6
1504
by: spider661 | last post by:
im trying to make a perl file that will take all the info from a spell_us.txt file and place it into an sql file i can import into my database but its not working can anyone help please? #!/usr/bin/perl -w # # Imports spells from spell_us.txt to spell table, as described in the config below $spellfile="spells_us.txt"; $sqlspellfile="allspells.sql"; $tbspells = "spells";
0
9672
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9519
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10436
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10213
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10000
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6780
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5436
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4113
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2920
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.