473,785 Members | 2,380 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

help make it faster please

I wrote this function which does the following:
after readling lines from file.It splits and finds the word occurences
through a hash table...for some reason this is quite slow..can some one
help me make it faster...
f = open(filename)
lines = f.readlines()
def create_words(li nes):
cnt = 0
spl_set = '[",;<>{}_&?! ():-[\.=+*\t\n\r]+'
for content in lines:
words=content.s plit()
countDict={}
wordlist = []
for w in words:
w=string.lower( w)
if w[-1] in spl_set: w = w[:-1]
if w != '':
if countDict.has_k ey(w):
countDict[w]=countDict[w]+1
else:
countDict[w]=1
wordlist = countDict.keys( )
wordlist.sort()
cnt += 1
if countDict != {}:
for word in wordlist: print (word+' '+
str(countDict[word])+'\n')

Nov 10 '05 #1
19 1987
why reload wordlist and sort it after each word processing ? seems that
it can be done after the for loop.

pk******@gmail. com wrote:
I wrote this function which does the following:
after readling lines from file.It splits and finds the word occurences
through a hash table...for some reason this is quite slow..can some one
help me make it faster...
f = open(filename)
lines = f.readlines()
def create_words(li nes):
cnt = 0
spl_set = '[",;<>{}_&?! ():-[\.=+*\t\n\r]+'
for content in lines:
words=content.s plit()
countDict={}
wordlist = []
for w in words:
w=string.lower( w)
if w[-1] in spl_set: w = w[:-1]
if w != '':
if countDict.has_k ey(w):
countDict[w]=countDict[w]+1
else:
countDict[w]=1
wordlist = countDict.keys( )
wordlist.sort()
cnt += 1
if countDict != {}:
for word in wordlist: print (word+' '+
str(countDict[word])+'\n')


Nov 10 '05 #2
Actually I create a seperate wordlist for each so called line.Here line
I mean would be a paragraph in future...so I will have to recreate the
wordlist for each loop

Nov 10 '05 #3
Oh sorry indentation was messed here...the
wordlist = countDict.keys( )
wordlist.sort()
should be outside the word loop.... now
def create_words(li nes):
cnt = 0
spl_set = '[",;<>{}_&?! ():-[\.=+*\t\n\r]+'
for content in lines:
words=content.s plit()
countDict={}
wordlist = []
for w in words:
w=string.lower( w)
if w[-1] in spl_set: w = w[:-1]
if w != '':
if countDict.has_k ey(w):
countDict[w]=countDict[w]+1
else:
countDict[w]=1
wordlist = countDict.keys( )
wordlist.sort()
cnt += 1
if countDict != {}:
for word in wordlist: print (word+' '+
str(countDict[word])+'\n')

ok now this is the correct question I am asking...

Nov 10 '05 #4
don't know your intend so have no idea what it is for. However, you are
doing :

wordlist=contDi ct.keys()
wordlist.sort()

for every word processed yet you don't use the content of x in anyway
during the loop. Even if you need one fresh snapshot of contDict after
each word, I don't see the need for sorting. seems like wasting cycles
to me.

pkila...@gmail. com wrote:
Actually I create a seperate wordlist for each so called line.Here line
I mean would be a paragraph in future...so I will have to recreate the
wordlist for each loop


Nov 10 '05 #5
You're making a new countDict for each line read from the file... is
that what you meant to do? Or are you trying to count word occurrences
across the whole file?

--

In general, any time string manipulation is going slowly, ask yourself,
"Can I use the re module for this?"

# disclaimer: untested code. probably contains typos

import re
word_finder = re.compile('[a-z0-9_]+', re.I)

def count_words (string, word_finder = word_finder): # avoid global
lookups
countDict = {}
for match in word_finder.fin diter(string):
word = match.group(0)
countDict[word] = countDict.get(w ord,0) + 1
return countDict

f = open(filename)
for i, line in enumerate(f.xre adlines()):
countDict = count_words(lin e)
print "Line %s" % i
for word in sorted(countDic t.keys()):
print " %s %s" % (word, countDict[word])

f.close()

Nov 10 '05 #6
This can be faster, it avoids doing the same things more times:

from string import maketrans, ascii_lowercase , ascii_uppercase

def create_words(af ile):
stripper = """'[",;<>{}_&?! ():[]\.=+-*\t\n\r^%012345 6789/"""
mapper = maketrans(strip per + ascii_uppercase ,
" "*len(strip per) + ascii_lowercase )
countDict = {}
for line in afile:
for w in line.translate( mapper).split() :
if w:
if w in countDict:
countDict[w] += 1
else:
countDict[w] = 1
word_freq = countDict.items ()
word_freq.sort( )
for word, freq in word_freq:
print word, freq

create_words(fi le("test.txt") )
If you can load the whole file in memory then it can be made a little
faster...

Bear hugs,
bearophile

Nov 10 '05 #7
pk******@gmail. com wrote:
I wrote this function which does the following:
after readling lines from file.It splits and finds the word occurences
through a hash table...for some reason this is quite slow..can some one
help me make it faster...
f = open(filename)
lines = f.readlines()
def create_words(li nes):
cnt = 0
spl_set = '[",;<>{}_&?! ():-[\.=+*\t\n\r]+'
for content in lines:
words=content.s plit()
countDict={}
wordlist = []
for w in words:
w=string.lower( w)
if w[-1] in spl_set: w = w[:-1]
if w != '':
if countDict.has_k ey(w):
countDict[w]=countDict[w]+1
else:
countDict[w]=1
wordlist = countDict.keys( )
wordlist.sort()
cnt += 1
if countDict != {}:
for word in wordlist: print (word+' '+
str(countDict[word])+'\n')

The way this is written you create a new countDict object
for every line of the file, it's not clear that this is
what you meant to do.

Also you are sorting wordlist for every line, not just
the entire file because it is inside the loop that is
processing lines.

Some extra work by testing for empty dictionary:

wordlist=countD ict.keys()

then

if countdict != {}:
for word in wordlist:

if countDict is empty then wordlist will be empty so testing
for it is unnecessary.

Incrementing cnt, but never using it.

I don't think spl_set will do what you want, but I haven't modified
it. To split on all those characters you are going to need to
use regular expressions not split.
Modified code:

def create_words(li nes):
spl_set = '[",;<>{}_&?! ():-[\.=+*\t\n\r]+'
countDict={}
for content in lines:
words=content.s plit()
for w in words:
w=w.lower()
if w[-1] in spl_set: w = w[:-1]
if w:
if countDict.has_k ey(w):
countDict[w]=countDict[w]+1
else:
countDict[w]=1

return countDict
import time
filename=r'C:\c ygwin\usr\share \vim\vim63\doc\ version5.txt'
f = open(filename)
lines = f.readlines()
start_time=time .time()
countDict=creat e_words(lines)
stop_time=time. time()
elapsed_time=st op_time-start_time
wordlist = countDict.keys( )
wordlist.sort()
for word in wordlist:
print "word=%s count=%i" % (word, countDict[word])

print "Elapsed time in create_words function=%.2f seconds" % elapsed_time

I ran this against a 551K text file and it runs in 0.11 seconds
on my machine (3.0Ghz P4).

Larry Bates
Nov 10 '05 #8
ok this sounds much better..could you tell me what to do if I want to
leave characters like @ in words.So I would like to consider this as a
part of word

Nov 10 '05 #9
The word_finder regular expression defines what will be considered a
word.

"[a-z0-9_]" means "match a single character from the set {a through z,
0 through 9, underscore}".
The + means "match as many as you can, minimum of one"

To match @ as well, add it to the set of characters to match:

word_finder = re.compile('[a-z0-9_@]+', re.I)

The re.I flag makes the expression case insensitive.
See the documentation for re for more information.
Also--- It looks like I forgot to lowercase matched words. The line
word = match.group(0)
should read:
word = match.group(0). lower()

Nov 10 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2659
by: Orion | last post by:
Hi, This is kind of last minute, I have a day and a half left to figure this out. I'm working on a project using ms-sqlserver. We are creating a ticket sales system, as part of the system, I need to be able to do a search for specific tickets withing price ranges, different locations within the theaters, etc. etc. My problem is in the search one of the criteria is to search for a group of seats together. For example let's say...
24
2896
by: arcticool | last post by:
I had an interview today and I got destroyed :( The question was why have a stack and a heap? I could answer all the practical stuff like value types live on the stack, enums are on the stack, as are structs, where classes are on the heap... when value types go out of scope the memory is re- allocated, object remain in memory waiting to be cleaned up by the garbage collector, etc, but he responded 'so why not just put say a class on the...
5
7150
by: Martin | last post by:
I'm a newbie to Java and have today completed my first Java applet. It's a panorama viewer which i intend to use on my website instead of Apple's Quicktime Virtual Reality (QTVR) format. I've used the QTVR format for a while but think that i'd get more visitors with a Java applet - too few visitors have the QTVR browser plugin installed and with the Quicktime installation now a massive 34.8MB i feel i'm losing visitors because it's just...
31
4605
by: Extremest | last post by:
I have a loop that is set to run as long as the arraylist is > 0. at the beginning of this loop I grab the first object and then remove it. I then go into another loop that checks to see if there are more objects that match the first object that i grabbed. If they match then I put them in an array. I would like to remove each match from the arraylist as I find them to speed things up and so that they don't get checked again. If I try...
10
2185
by: Extremest | last post by:
I know there are ways to make this a lot faster. Any newsreader does this in seconds. I don't know how they do it and I am very new to c#. If anyone knows a faster way please let me know. All I am doing is quering the db for all the headers for a certain group and then going through them to find all the parts of each post. I only want ones that are complete. Meaning all segments for that one file posted are there. using System;
15
2581
by: Jay | last post by:
I have a multi threaded VB.NET application (4 threads) that I use to send text messages to many, many employees via system.timer at a 5 second interval. Basically, I look in a SQL table (queue) to determine who needs to receive the text message then send the message to the address. Only problem is, the employee may receive up to 4 of the same messages because each thread gets the recors then sends the message. I need somehow to prevent...
5
1434
by: Joel | last post by:
(1) Can anyone please tell me the exact meaning of primitive types in the MSIL context. (2) Also what is the meaning of the world inline? (3) What is the meaning of the statement: "You should bear in mind, however, that decimal is not implemented under the hood as a primitive type, so using decimal will have a performance impact on your calculations." ?
41
2702
by: c | last post by:
Hi every one, Me and my Cousin were talking about C and C#, I love C and he loves C#..and were talking C is ...blah blah...C# is Blah Blah ...etc and then we decided to write a program that will calculate the factorial of 10, 10 millions time and print the reusult in a file with the name log.txt.. I wrote something like this
6
1504
by: spider661 | last post by:
im trying to make a perl file that will take all the info from a spell_us.txt file and place it into an sql file i can import into my database but its not working can anyone help please? #!/usr/bin/perl -w # # Imports spells from spell_us.txt to spell table, as described in the config below $spellfile="spells_us.txt"; $sqlspellfile="allspells.sql"; $tbspells = "spells";
0
9643
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9480
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9947
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8968
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6737
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5379
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5511
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4045
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2877
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.