473,578 Members | 3,235 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

large dictionary creation takes a LOT of time.

this code here:
def wordcount(lines ):
for i in range(len(lines )/8):
words = lines[i].split(" ")
if not locals().has_ke y("frequency" ):
frequency = {}
for word in words:
if frequency.has_k ey(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines )

is taking over six minutes to run on a two megabyte text file. i
realize that's a big text file, a really big one (it's actually the
full text of don quixote.). i'm trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
it seems to me like this should scale linearly, but perhaps it isn't?
i don't know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i'd be much obliged.

lines is expected to be a list of lines as provided by file.readline()

Jul 19 '05 #1
12 1555
"possibilitybox " <po************ @gmail.com> writes:
this code here:
def wordcount(lines ):
for i in range(len(lines )/8):
words = lines[i].split(" ")
if not locals().has_ke y("frequency" ):
frequency = {}
for word in words:
if frequency.has_k ey(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines )

is taking over six minutes to run on a two megabyte text file. i
realize that's a big text file, a really big one (it's actually the
full text of don quixote.). i'm trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
2MB is not that large. Your method is ok and shouldn't be that slow
unless you're on a pretty slow PC. Could your machine be short of
memory and paging a lot? You could tweak the code somewhat by moving
the initialization of the frequency dict out of the loop and combining
a few other statements. Also you should use xrange instead of range,
to avoid allocating a big list in memory:

def wordcount(lines ):
frequency = {}
for i in xrange(len(line s)/8):
for word in lines[i].split():
frequency[word] = 1 + frequency.get(w ord, 0)
return frequency
wordcount(lines )
it seems to me like this should scale linearly, but perhaps it isn't?
i don't know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i'd be much obliged.


It should be close to linear, or at worst n log n, depending on what
happens when dicts have to be enlarged as the # of elements increases.
Why are you only processing 1/8th of the lines?
Jul 19 '05 #2
possibilitybox wrote:
this code here:
def wordcount(lines ):
for i in range(len(lines )/8):
words = lines[i].split(" ")
if not locals().has_ke y("frequency" ):
frequency = {}
for word in words:
if frequency.has_k ey(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines )

is taking over six minutes to run on a two megabyte text file. i
realize that's a big text file, a really big one (it's actually the
full text of don quixote.). i'm trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
it seems to me like this should scale linearly, but perhaps it isn't?
i don't know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i'd be much obliged.

lines is expected to be a list of lines as provided by file.readline()


Here is a little cleaner version. It takes about a second to run on my PC. What hardware are you
running on?

path = 'DonQuixote.txt '

frequency = {}

for line in open(path):
for word in line.split():
if frequency.has_k ey(word):
frequency[word] += 1
else:
frequency[word] = 1

print len(frequency), 'words'
Kent
Jul 19 '05 #3
>>>>> "Kent" == Kent Johnson <ke****@tds.net > writes:

Kent> if frequency.has_k ey(word):
Kent> frequency[word] += 1
Kent> else:
Kent> frequency[word] = 1

This is a good place to use 'get' method of dict:

frequency[word] = frequency.get(w ord,0) + 1

--
Ville Vainio http://tinyurl.com/2prnb
Jul 19 '05 #4
Ville Vainio:
This is a good place to use 'get' method of dict:
frequency[word] = frequency.get(w ord,0) + 1


I think Kent Johnson is longer, but a bit faster...

Bye,
Bearophile

Jul 19 '05 #5
In article <11************ **********@z14g 2000cwz.googleg roups.com>,
"possibilitybox " <po************ @gmail.com> wrote:
this code here:
def wordcount(lines ):
for i in range(len(lines )/8):
words = lines[i].split(" ")
if not locals().has_ke y("frequency" ):
frequency = {}
for word in words:
if frequency.has_k ey(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines )

is taking over six minutes to run on a two megabyte text file. i
realize that's a big text file, a really big one (it's actually the
full text of don quixote.).
Something doesn't make sense with your timing.

I just downloaded the text of Don Quixote
(http://www.gutenberg.org/dirs/9/9/996/996.txt). It's about 2.3 Mbytes,
428 kwords, 40 klines. This includes the text of the novel itself, plus a
little boilerplate text added by the Gutenberg Project folks.

I ran your program against it on my PowerBook (1 GHz PowerPC, Python 2.3.4,
768 Mbytes RAM). It took about 0.4 seconds. When I got rid of the "/8" in
the range() call (so it processed the whole text), it took about 1.8
seconds (0.24 of which were just reading the file). Some other posters
reported similiar findings.

What kind of system are you running it on? The only thing I can think of
is that you've got way too little memory and your machine is just
thrashing. My Python process is about 31 Mb when it first starts up, grows
to 35 when the file is read into a list, then gets to 38 after the call to
wordcount().
i'm trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
it seems to me like this should scale linearly, but perhaps it isn't?
i don't know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i'd be much obliged.
Well, I don't see anything in your code which isn't linear. There are some
places where you could do things a little more efficiently, but these would
all be replacing one linear process with a better linear process. Small
speedups, but not the drastic kind of speedups you would expect by
replacing a quadratic process with a linear one.

Here's a few things I would change:
if not locals().has_ke y("frequency" ):
frequency = {}
This is kind of silly. Just factor this out of the main loop, and do
"frequency = {}" before the first for loop. This won't speed things up
other than trivially, but it'll make your code easier to read and
understand.

Next, replace
for i in range(len(lines )):
words = lines[i].split(" ")
with something like
for line in lines:
words = line.split()
It's marginally faster, easier to read, and is actually more correct;
calling split() with no arguments makes it split on arbitrary white space,
which is probably what you really want:
"foo bar baz".split(" ") ['foo', 'bar', '', '', '', 'baz']
"foo bar baz".split()

['foo', 'bar', 'baz']

And lastly, instead of

if frequency.has_k ey(word): frequency[word] += 1
else:
frequency[word] = 1


I would do:

try:
frequency[word] += 1
except KeyError:
frequency[word] = 1

which is usually a little bit faster. Somebody else mentioned using the
get() method of dictionaries here; that might be even better, but I learned
the try/except trick before get() existed, so I tend to stick to that :-)

But, as I said, all of these are just minor tweaks. Overall, your code
looks like it should run in O(n), so fixing your code is not where you
should be looking. Something external (i.e. memory thrashing or some other
exceptionally bad bit of system performance) has to be causing the
horrendously bad performance you're seeing, and that's where you should be
concentrating your efforts.

BTW, if you suspected you had some kind of non-linear algorithm, and didn't
trust code inspection to verify its existance, you could just run your
program a bunch of times on different sized data sets, and plot runtime vs.
input size to see what kind of curve you get.
Jul 19 '05 #6
Kent Johnson wrote:

Here is a little cleaner version. It takes about a second to run on my
PC. What hardware are you running on?

path = 'DonQuixote.txt '

frequency = {}

for line in open(path):
for word in line.split():
if frequency.has_k ey(word):
frequency[word] += 1
else:
frequency[word] = 1

print len(frequency), 'words'
Kent for line in open(path):

the line of your example raise another question: opened file will be read at once time, as method readlines() do, or it will be read line by line as method readline() do.
as far i know, it is depends on implementation of method "__iter__" of the object that "open()" returns, so another question: where i can find such an information (about how does such a functions works)?
--
Best regards,
Maksim Kasimov
mailto: ka*****@i.com.u a
Jul 19 '05 #7
Maksim Kasimov wrote:
Kent Johnson wrote:
> for line in open(path): the line of your example raise another question: opened file will be
read at once time, as method readlines() do, or it will be read line by
line as method readline() do.


It will be read line by line as readline() does.
as far i know, it is depends on implementation of method "__iter__" of
the object that "open()" returns, so another question: where i can find
such an information (about how does such a functions works)?


http://docs.python.org/lib/built-in-funcs.html
http://docs.python.org/lib/bltin-file-objects.html

Kent
Jul 19 '05 #8
In fact, as one of the Peter's (either Otten or Hansen) explained to me,

for line in open(file):

is actually both faster (being buffered) and generally better for very
large files because it doesn't read the whole file into memory, like
readlines does (if you have a memory limitation).

On Fri, 29 Apr 2005 12:00:37 -0400, Kent Johnson <ke****@tds.net > wrote:
Maksim Kasimov wrote:
Kent Johnson wrote:
> for line in open(path):

the line of your example raise another question: opened file will be
read at once time, as method readlines() do, or it will be read line by
line as method readline() do.


It will be read line by line as readline() does.
as far i know, it is depends on implementation of method "__iter__" of
the object that "open()" returns, so another question: where i can find
such an information (about how does such a functions works)?


http://docs.python.org/lib/built-in-funcs.html
http://docs.python.org/lib/bltin-file-objects.html

Kent


Jul 19 '05 #9
On Friday 29 April 2005 11:53, Ville Vainio wrote:
>> "Kent" == Kent Johnson <ke****@tds.net > writes:


Kent> if frequency.has_k ey(word):
Kent> frequency[word] += 1
Kent> else:
Kent> frequency[word] = 1

This is a good place to use 'get' method of dict:

frequency[word] = frequency.get(w ord,0) + 1


try/except might be fastest of all:

http://gumuz.looze.net/wordpress/ind...-optimisation/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQBCcqaWY6W 16wIJgxQRAkSKAJ 0VDc2Z6HY9J11vI DfSuOztyryprACg qfLy
TwuBUdEE6VlLf0t JPCiVeoY=
=oUbE
-----END PGP SIGNATURE-----

Jul 19 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2510
by: brianobush | last post by:
# # My problem is that I want to create a # class, but the variables aren't known # all at once. So, I use a dictionary to # store the values in temporarily. # Then when I have a complete set, I want to # init a class from that dictionary. # However, I don't want to specify the # dictionary gets by hand # since it is error prone.
125
7138
by: Raymond Hettinger | last post by:
I would like to get everyone's thoughts on two new dictionary methods: def count(self, value, qty=1): try: self += qty except KeyError: self = qty def appendlist(self, key, *values): try:
7
1460
by: matvdl | last post by:
I have migrated my asp application to asp.net some time ago - but I am still having some difficulties in understanding the best way to mange some tasks. I currently have a page that loads a aspx web page - this page is continually refreshed - every 5 seconds or so. To do this I use the download behavior on the client to call a particular...
2
3537
by: Atul | last post by:
In my winform application, to access webmethod, we create a webservice proxy object. For the first time, when winform application is started, for creating proxy object(e.g. MyWebServiceProxy oProxy = new MyWebServiceProxy() )it takes a long time(appr. 2 seconds), but for the next time it happens in 0-10 milli-seconds. For each proxy object...
57
10812
by: Chris Foote | last post by:
Hi all. I have the need to store a large (10M) number of keys in a hash table, based on a tuple of (long_integer, integer). The standard python dictionary works well for small numbers of keys, but starts to perform badly for me inserting roughly 5M keys: # keys dictionary metakit (both using psyco) ------ ---------- ------- 1M ...
10
6412
by: Petr Jakeš | last post by:
I have a standard 12-key mobile phone keypad connected to my Linux machine as a I2C peripheral. I would like to write a code which allows the text entry to the computer using this keypad (something like T9 on the mobile phones) According to the http://www.yorku.ca/mack/uist01.html dictionary-based disambiguation is coming in the mind. ...
8
3102
by: akameswaran | last post by:
I wrote up a quick little set of tests, I was acutally comparing ways of doing "case" behavior just to get some performance information. Now two of my test cases had almost identical results which was not at all what I expected. Ultimately I realized I don't really know how literals are treated within the interpreter. The two...
20
10333
by: Simon Strobl | last post by:
Hello, I tried to load a 6.8G large dictionary on a server that has 128G of memory. I got a memory error. I used Python 2.5.2. How can I load my data? SImon
6
1726
by: Patrick Sullivan | last post by:
Hello. I will be using some large data sets ("points" from 2 to 12 variables) and would like to use one class for each point rather than a list or dictionary. I imagine this is terribly inefficient, but how much? What is the cost of creating a new class? What is the cost of referencing a class variable?
0
7847
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
8125
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8290
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
8148
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6522
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5664
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5342
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
1
2292
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
0
1113
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.