Optimizing a text statistics function

Nickolay Kolev

Hi all,

I am currently writing some simple functions in the process of learning
Python. I have a task where the program has to read in a text file and
display some statistics about the tokens in that file.

The text I have been feeding it is Dickens' David Copperfield.

It is really simple - it reads the file in memory, splits it on
whitespace, strips punctuation characters and transforms all remaining
elements to lowercase. It then looks through what has been left and
creates a list of tuples (count, word) which contain each unique word
and the number of time it appears in the text.

The code (~30 lines and easy to read :-) can be found at
http://www.uni-bonn.de/~nmkolev/python/textStats.py

I am now looking for a way to make the whole thing run faster. I have
already made many changes since the initial version, realising many
mistakes. As I do not think of anything else, I thought I would ask the
more knowledgeable.

I find the two loops through the initial list a bit troubling. Could
this be avoided?

Any other remarks and suggestions will also be greatly appreciated.

Many thanks in advance,
Nicky

Jul 18 '05 #1

Subscribe Post Reply

1850

Dennis Lee Bieber

On Wed, 21 Apr 2004 16:51:56 +0200, Nickolay Kolev <nm*****@uni-bonn.de>
declaimed the following in comp.lang.python:

It is really simple - it reads the file in memory, splits it on
whitespace, strips punctuation characters and transforms all remaining
elements to lowercase. It then looks through what has been left and
creates a list of tuples (count, word) which contain each unique word
and the number of time it appears in the text.
Without looking at the code, I'd probably drop the use of the
tuples for a dictionary keyed by the word, with the data being the
count. Granted, the output would not be easily sorted...

Okay, you do have a dictionary...

I suggest dropping the initialization pass -- for common words
it's going to be redundant... Dictionaries have a method for supplying a
default value if a key doesn't exist. See the following:
## for x in strippedWords:
## unique[x] = 0
##
## for x in strippedWords:
## unique[x] += 1

for x in strippedWords:
unique[x] = unique.get(x, 0) + 1
-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Jul 18 '05 #2

Jack Diederich

On Wed, Apr 21, 2004 at 04:51:56PM +0200, Nickolay Kolev wrote:

It is really simple - it reads the file in memory, splits it on
whitespace, strips punctuation characters and transforms all remaining
elements to lowercase. It then looks through what has been left and
creates a list of tuples (count, word) which contain each unique word
and the number of time it appears in the text.

The code (~30 lines and easy to read :-) can be found at
http://www.uni-bonn.de/~nmkolev/python/textStats.py

I am now looking for a way to make the whole thing run faster.

Do you actually need it to be faster? If the answer is "no, but
it would be nice." then you are already done *wink*.

A good profiling strategy is to wrap each part in a function
so you can see which lines consume the most CPU. Just make sure
to wrap big pieces so the function call overhead doesn't get
added ten thousand times and distort the picture.

You will get a bunch of suggestions on how to make the code faster,
so I'll skip those. What you want to do is only do the expensive
parsing once and not every time you run your program. Try pickle.
[untested code follows]

import pickle

def proc(txt_filename): # txt_filename like 'dickens.txt'
... exiting code ...
reutrn wordCountList

def procWrap(txt_filename):
cache_filename = tst_filename.replace('.txt', '.pickle')
try:
fob = open(cache_filename)
wordCountList = pickle.load(fob)
except IOError:
wordCountList = proc(txt_filename)
fob = open(cache_filename, 'w+')
pickle.dump(wordCountList, fob, -1) # see the docs about the '-1'
return wordCountList

Use procWrap() instead of proc() to get the list. You'll need
to delete the .pickle file every time you change proc() so the
pickle gets refreshed. This way you never have to care about how
effecient the parsing loop is, because you only have to call it once.

-jackdied

Jul 18 '05 #3

Nickolay Kolev

Dennis Lee Bieber wrote:

## for x in strippedWords:
## unique[x] = 0
##
## for x in strippedWords:
## unique[x] += 1

for x in strippedWords:
unique[x] = unique.get(x, 0) + 1

Just the thing I was looking for. :-)

To clarify:

unique.get(x, 0)

This says *give me the value of unique[x] if it exists, else give me 0*.
Correct?

Jul 18 '05 #4

Neil Benn

Hello,

I don't know python as well as most people on this list so
this is a leading question.

In other languages I've used (mainly java although some C, C# VB
<wash out your mouth>), the way I woud look at speeding this up is to
avoid loading all the words into memory in one go and then working upon
them. I'd create one stream which reads through the file, then passes
onto a listener each word it finds from the lexing (making the input
into tokens) and then another stream listening to this which will then
sort out the detail from these tokens (parsing), finally an output
stream which put this data wherever it needs to be (DB, screen, file,
etc). This means that the program would scale better (if you pass the
European voting register through your system it would take exponentially
much longer as you must scan the information twice).

However as more experienced python programmers have not suggested
this is this because there is :

a. Something I'm not getting about python text handling
b. Not easy/possible in python

I ask this because I've been looking at (C)StringIO and it is OK for
this kind of behaviour using strings (reading from a serial port and
passing onto middleware) but it doesn't seem to have the ability to
remove characters from the buffer once they have been read, therefore
the buffer will grow and grow as the process runs. For now I'm having
to use strings which is less than ideal because they are immutable (I
was think of writing my own StringBuffer class will will discard
characters once they have been read from the StringBuffer) and therefore
my program scales badly.

However I do agree with the earlier poster - don't optimise for
speed unless you need to (I'm assuming this is an academic exercise and
I'm waiting to go to the pub)!!! Simplicity of design is usually a
better win.

Cheers,

Neil

Dennis Lee Bieber wrote:

On Wed, 21 Apr 2004 16:51:56 +0200, Nickolay Kolev <nm*****@uni-bonn.de>
declaimed the following in comp.lang.python:

It is really simple - it reads the file in memory, splits it on
whitespace, strips punctuation characters and transforms all remaining
elements to lowercase. It then looks through what has been left and
creates a list of tuples (count, word) which contain each unique word
and the number of time it appears in the text.

Without looking at the code, I'd probably drop the use of the
tuples for a dictionary keyed by the word, with the data being the
count. Granted, the output would not be easily sorted...

Okay, you do have a dictionary...

I suggest dropping the initialization pass -- for common words
it's going to be redundant... Dictionaries have a method for supplying a
default value if a key doesn't exist. See the following:
## for x in strippedWords:
## unique[x] = 0
##
## for x in strippedWords:
## unique[x] += 1

for x in strippedWords:
unique[x] = unique.get(x, 0) + 1
--
================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

--

Neil Benn
Senior Automation Engineer
Cenix BioScience
PfotenhauerStrasse 108
D-01307
Dresden
Germany

Tel : +49 (351) 210 1300
e-mail : be**@cenix-bioscience.com
Cenix Website : http://www.cenix-bioscience.com

Jul 18 '05 #5

Peter Otten

Nickolay Kolev wrote:

I am currently writing some simple functions in the process of learning
Python. I have a task where the program has to read in a text file and
display some statistics about the tokens in that file.

The text I have been feeding it is Dickens' David Copperfield.

It is really simple - it reads the file in memory, splits it on
whitespace, strips punctuation characters and transforms all remaining
elements to lowercase. It then looks through what has been left and
creates a list of tuples (count, word) which contain each unique word
and the number of time it appears in the text.

The code (~30 lines and easy to read :-) can be found at
http://www.uni-bonn.de/~nmkolev/python/textStats.py

I am now looking for a way to make the whole thing run faster. I have
already made many changes since the initial version, realising many
mistakes. As I do not think of anything else, I thought I would ask the
more knowledgeable.

I find the two loops through the initial list a bit troubling. Could
this be avoided?

Any other remarks and suggestions will also be greatly appreciated.

I'd probably do this with regular expressions, either with r.split() where r
matches the spaces between the words or r.finditer() where r matches the
words.

But since you were asking for speed, another approach might be faster. First
I normalize all non-word characters to space and all alpha-chars to
lowercase. Note that this may yield different word frequencies as e.g.
"two.words" will be 2 words with my and one word with your approach.

from __future__ import division
import string, sys

def main(filename):
# caveat: assumes one byte per character
repl = string.whitespace + string.punctuation
tr = string.maketrans(repl, len(repl)*" ").lower()
assert len(tr) == 256

words = file(filename).read().translate(tr).split()
histogram = {}
wordCount = len(words)
for word in words:
histogram[word] = histogram.get(word, 0) + 1

wordCountList = [(c, w) for w, c in histogram.items()]
wordCountList.sort()
wordCountList.reverse()

print
for freq, word in wordCountList[:10]:
print '%15r %5d %5.2f%%' % (word, freq, freq / wordCount * 100)
print

if __name__ == "__main__":
main(sys.argv[1])

Other remarks:
Speed is not that important most of the time.
Psyco offers speed for free.
You might consider switching to 4-space indent.

Peter

Jul 18 '05 #6

Terry Reedy

"Neil Benn" <be**@cenix-bioscience.com> wrote in message
news:40**************@cenix-bioscience.com...

In other languages I've used (mainly java although some C, C# VB
<wash out your mouth>), the way I woud look at speeding this up is to
avoid loading all the words into memory in one go and then working upon
them. I'd create one stream which reads through the file, then passes
onto a listener each word it finds from the lexing (making the input
into tokens) and then another stream listening to this which will then
sort out the detail from these tokens (parsing), finally an output
stream which put this data wherever it needs to be (DB, screen, file,
etc). This means that the program would scale better (if you pass the
European voting register through your system it would take exponentially
much longer as you must scan the information twice).
You are talking about chaining iterators, which the generators and the new
iterator protocol make easier than before -- intentionally. Something like
following (untested).

filename = 'whatever'

def wordify(source):
for line in source:
for word in line.split():
yield word.strip()

def tabulate(words):
counts = {}
for word in words:
counts[word] = counts.get[word,0]
for wordcount in count.iteritems():
yield wordcount

def disposer(wordcounts):
for wordcount in wordcounts:
print wordcount

disposer(tabulate(wordify(filename)))

However as more experienced python programmers have not suggested
this is this because there is :

a. Something I'm not getting about python text handling
Nothing obvious.
b. Not easy/possible in python

Wrong (see above)

c. The OPs question (speedup) was answered a half hour after posting by an
experienced P. progammer (use dict.get) -- which answer makes the
processing one-pass, which in turn makes chaining possible.

d. It has been less than 5 hours since OP posted.

e. The OP did not ask how to restructure program to make it more modular.
Thinking ahead to make code more reusable and scaleable is a second order
concern after learning basics like getting .get(). But since you brought
the subject up ...

Terry J. Reedy

Jul 18 '05 #7

Nickolay Kolev

Peter Otten wrote:

Thanks for taking the time to rewrite the whole thing. I will try to
work through it as I do not understand what you are doing most of the
time there.

Speed is not that important most of the time.
Agreed. I am not looking for a magical solution that will shave some
time off at the price of using things I do not understand and/or are not
readable.
You might consider switching to 4-space indent.

Not wanting to start a flame here (this is probably a much discussed
topic): why? I write in VIM and use tabs. The default representation
width of the tab char there is 8 spaces. If I do get many, many nested
blocks, I could just set it to a lower width, no?

Nicky

Jul 18 '05 #8

Benji York

Nickolay Kolev <nm*****@uni-bonn.de> wrote in message news:<c6**********@f1node01.rhrz.uni-bonn.de>...

I find the two loops through the initial list a bit troubling. Could
this be avoided?

How about instead of:

for x in strippedWords:
unique[x] = 0

for x in strippedWords:
unique[x] += 1

You do something like (untested):

for word in strippedWords:
unique[word] = unique.get(word, 0) + 1

--
Benji York
http://benjiyork.com

Jul 18 '05 #9

Scott David Daniels

Peter Otten wrote:

Nickolay Kolev wrote: Playing along, simply because it's fun.
def main(filename):
...
#> words = file(filename).read().translate(tr).split()
#> histogram = {}
#> wordCount = len(words)
#> for word in words:
#> histogram[word] = histogram.get(word, 0) + 1

Better not to do several huge string allocs above (I suspect).
This method lets you to work on files too large to read into memory:

wordCount = 0
histogram = {}

for line in file(filename):
words = line.translate(tr).split()
wordCount += len(words)
for word in words:
histogram[word] = histogram.get(word, 0) + 1
...

- Scott David Daniels
Sc***********@Acm.Org

Jul 18 '05 #10

Peter Otten

Scott David Daniels wrote:

Peter Otten wrote:
Nickolay Kolev wrote:

Playing along, simply because it's fun.
def main(filename):
...

#> words = file(filename).read().translate(tr).split()
#> histogram = {}
#> wordCount = len(words)
#> for word in words:
#> histogram[word] = histogram.get(word, 0) + 1

Better not to do several huge string allocs above (I suspect).
This method lets you to work on files too large to read into memory:

wordCount = 0
histogram = {}

for line in file(filename):
words = line.translate(tr).split()
wordCount += len(words)
for word in words:
histogram[word] = histogram.get(word, 0) + 1
...

In theory you are right, in practice most text files are tiny compared to
the amount of RAM available on a fairly recent machine.

However, I readily admit that your variant plays nicely with *very* large
files as long as they have newlines, and, contrary to my expectation, is
even a bit faster with my adhoc testcase, a 670,000 byte 8,700 line HTML
file.

Peter

Jul 18 '05 #11

Dennis Lee Bieber

On Wed, 21 Apr 2004 18:18:16 +0200, Nickolay Kolev <nm*****@uni-bonn.de>
declaimed the following in comp.lang.python:

unique.get(x, 0)

This says *give me the value of unique[x] if it exists, else give me 0*.
Correct?
Unless I misread the help file... I do know there IS a method
that does this, as I've used it in the past. The other scheme would be a
try/except block...

try:
unique[x] = unique[x] + 1 #maybe use the newer += operator
except:
unique[x] = 1

At first, this will tend to cause the exception to trigger on
each word, but as you start getting repeated words, the first operation
will complete.
-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Jul 18 '05 #12

Nickolay Kolev

Scott David Daniels wrote:

Better not to do several huge string allocs above (I suspect).
This method lets you to work on files too large to read into memory:

.... code ...

Now this is something I was thinking about when I started considering
options for this task.

From my understanding, for line in file('...'): is only useful if I
want to make checks on the lines and then do something with them:

for line in file('...'):
if line.startswith('XXX'):
myDesiredLines.append(line)

This avoids creating a huge list of all lines and then filtering it to
another list with only the desired ones.

In my case I want to get all the lines regardless of any condition. I
also do not even need them as a list of lines, a single huge string is
completely sufficient. It will be split on whitespace later anyway.

Both methods should produce the same amount of memory usage as all words
are stored in a list. Reading a file line by line should be slower, as
Python would have to check where the newline characters are. Please
comment and correct, I am making assumptions here. Clarification from
someone who know how these things are internally implemented would be nice.

Nicky

Jul 18 '05 #13

Terry Reedy

"Nickolay Kolev" <nm*****@uni-bonn.de> wrote in message
news:c6**********@f1node01.rhrz.uni-bonn.de...

From my understanding, for line in file('...'): is only useful if I
want to make checks on the lines and then do something with them:
It is also useful for limiting amount in RAM at one time. The main
downside would be if the units you want to analyse, in this case words, are
split across line endings. But split() on whitespace will also split the
units at line endings.
Both methods should produce the same amount of memory usage as all words
are stored in a list.
It seems to me that 300 meg file chunk + 400 meg list/dict is larger than
small file chunk + 400 meg list/dict, but maybe I misunderstand what you
meant.
Reading a file line by line should be slower, as
Python would have to check where the newline characters are.

'for line in file' has now been optimized to run quickly. Behind the
scenes, sensibly sized chunks (such as 64K) are read from disk into a
memory buffer. Lines are then doled out one at a time. This is done with
compiled C. So I suspect the slowdown compared to read-all and split is
minimal.

Terry J. Reedy

Jul 18 '05 #14

by: J. Campbell | last post by:

From reading this forum, it is my understanding that C++ doesn't require the compiler to keep code that does not manifest itself in any way to the user. For example, in the following: { for(int...

C / C++

How do I manually set statistics for a VARGRAPHIC column

by: Philip Sherman | last post by:

I'm trying to copy production statistics to a test database and can't set the column statistics for a VARGRAPHIC column to match what RUNSTATS generated on production. The reason code and some...

DB2 Database

Optimizing query that spends a lot of time in 'statistics'

by: thomas.j.johnson | last post by:

I've got a very simple query ("SELECT a,b,c FROM mytable WHERE d=foo AND e=bar"; d and e are part of the primary key) that I'm executing tens of thousands of times in a particular application. The...

MySQL Database

Optimizing function pointer usage

by: Jack | last post by:

I have a chunk of code that loads a few dozen function pointers into global variables. I'm concerned with unused memory consumption. What if the client only needs to use one or two functions? Then...

C / C++

Collecting different execution statistics of C++ programs

by: romixnews | last post by:

Hi, I'm facing the problem of analyzing a memory allocation dynamic and object creation dynamics of a very big C++ application with a goal of optimizing its performance and eventually also...

C / C++

Optimizing Inner Loop Copy

by: Mark E. Fenner | last post by:

Hello all, I have a code where my inner loop looks like: allNew = for params in cases: newObj = copy(initialObject) newObj.modify(params) allNew.append(newObj) return allNew

Python

Text retrieval systems - 4B: the Library

by: JosAH | last post by:

Greetings, welcome back; above we discussed the peripherals of the Library class: loading and saving such an instantiation of it, the BookMark interface and then some. This part of the article...

Java

Analysing text files with Perl

by: Davo1977 | last post by:

Analysing text files to obtain statistics on their content You are to write a Perl program that analyses text files to obtain statistics on their content. The program should operate as follows: ...

Perl

SQL 2005 Optimizer not optimizing

by: John Rivers | last post by:

Hello has anybody else noticed I have queries that SQL 2000 optimizes correctly and they run very fast Yet SQL 2005 keeps using a dumb query plan and queries run very slow The problem...

Microsoft SQL Server

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Optimizing a text statistics function

Similar topics