By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,428 Members | 1,340 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,428 IT Pros & Developers. It's quick & easy.

Suggest more finesse, please. I/O and sequences.

P: n/a
Would you like to suggest me any improvements for the following code?
I want to make my implementation as simple, as Python - native, as fine as
possible.

I've written simple code, which reads input text file and creates words'
ranking by number of appearence.

Code:
---------------------------------------------------------------------------
import sys

def moreCommonWord( x, y ):
if x[1] != y[1]:
return cmp( x[1], y[1] ) * -1
return cmp( x[0], y[0] )

wordsDic = {}
inFile = open( sys.argv[1] )
for word in inFile.read().split():
if wordsDic.has_key( word ):
wordsDic[word] = wordsDic[word] + 1
else:
wordsDic[word] = 1
inFile.close()

wordsLst = wordsDic.items()
wordsLst.sort( moreCommonWord )

outFile = open( sys.argv[2], 'w')
for pair in wordsLst:
outFile.write( str( pair[1] ).rjust( 7 ) + " : " + str( pair[0] ) + "\n" )
outFile.close()
---------------------------------------------------------------------------

In particular, I don't like reading whole file just to split it.
It is easy to read by lines - may I read by words with that ease?

PS I've been learning Python since todays morning, so be understanding :>

--
Greets,
Piotrek
Jul 18 '05 #1
Share this Question
Share on Google+
8 Replies


P: n/a
Qertoip wrote:
import sys

def moreCommonWord( x, y ):
if x[1] != y[1]:
return cmp( x[1], y[1] ) * -1
return cmp( x[0], y[0] )
If you want to keep this, use:

def moreCommonWord(x, y):
if x[1] != y[1]:
return cmp(y[1], x[1])
return cmp(x[0], y[0])

....
I don't like type-based names (Charles Simonyi never convinced me), so: wordsDic = {} corpus = {}

.... for word in inFile.read().split():
if wordsDic.has_key( word ):
wordsDic[word] = wordsDic[word] + 1
else:
wordsDic[word] = 1
inFile.close()
How about:
for line in inFile:
for word in line.split():
try:
corpus[word] += 1
except KeyError:
corpus[word] = 1

....
wordsLst = wordsDic.items()
wordsLst.sort( moreCommonWord ) OK, here I'm going to get version specific.
For Python 2.4 and later:
words = sorted((-freq, word) for word, freq
in corpus.iteritems())
For at least Python 2.2:
words = [(-freq, word) for word, freq in corpus.iteritems()]
words.sort()
For before Python 2.2:
words = corpus.items()
words.sort(moreCommonWord)
for pair in wordsLst:
outFile.write( str( pair[1] ).rjust( 7 ) + " : " + str( pair[0] ) + "\n" )
outFile.close()


Before python 2.2 (because we use different data for words):
for word, frequency in words:
print >>outFile, '%7d : %s' % (frequency, word)

After python 2.2:
for negfrequency, word in words:
print >>outFile, '%7d : %s' % (-negfrequency, word)
So, with all my prejudices in place and python 2.4 on my box, I'd
lift a few things to functions:
def refcount(corpus, infile):
'''Update corpus counters in corpus from words in infile'''
for line in infile:
for word in line.split():
try:
corpus[word] += 1
except KeyError:
corpus[word] = 1
def main(sources, output=None):
'''Count words in sources and report frequencies to output'''
corpus = {}
for source in sources:
f = open(source)
refcount(corpus, f)
f.close()
for negfrequency, word in sorted((-frequency, word) for
word, frequency in corpus.iteritems()):
print >>output, '%7d : %s' % (-negfrequency, word)
if __name__ == '__main__':
import sys

if len(sys.argv) < 2:
main(sys.argv[1 :])
else:
output = open(sys.argv[-1], 'w')
try:
main(sys.argv[1 : -1], output)
finally:
output.close()
--Scott David Daniels
Sc***********@Acm.Org
Jul 18 '05 #2

P: n/a
Dnia Fri, 25 Mar 2005 12:51:59 -0800, Scott David Daniels napisał(a):

Thanks for your reply! It was really enlightening.

How about:
for line in inFile:
for word in line.split():
try:
corpus[word] += 1
except KeyError:
corpus[word] = 1
Above is (probably) not efficient when exception is thrown, that is most of
the time (for any new word). However, I've just read about the following:
corpus[word] = corpus.setdefault( word, 0 ) + 1

wordsLst = wordsDic.items()
wordsLst.sort( moreCommonWord )

OK, here I'm going to get version specific.
For Python 2.4 and later:
words = sorted((-freq, word) for word, freq in corpus.iteritems())


This is my favorite! :) You managed to avoid moreCommonWord() through the
clever use of list comprehensions and sequences comaparison rules.

After python 2.2:
for negfrequency, word in words:
print >>outFile, '%7d : %s' % (-negfrequency, word)
This is also cool, I didn't know about this kind of 'print' usage.

So, with all my prejudices in place and python 2.4 on my box, I'd
lift a few things to functions:


While I like your functionality and reusability improvements, I will stick
to my as-simple-as-possible solution for given requirements (which I didn't
mention, and which assume correct command line arguments for example).

Therefore, the current code is:
-------------------------------------------------------------------------
import sys

corpus = {}
inFile = open( sys.argv[1] )
for line in inFile:
for word in line.split():
corpus[word] = corpus.setdefault( word, 0 ) + 1
inFile.close()

words = sorted( ( -freq, word ) for word, freq in corpus.iteritems() )

outFile = open( sys.argv[2], 'w')
for negFreq, word in words:
print >>outFile, '%7d : %s' % ( -negFreq, word )
outFile.close()
-------------------------------------------------------------------------

Any ideas how to make it even better? :>
--
Regards,
Piotrek
Jul 18 '05 #3

P: n/a
You might take advantage of the .get method on
dictionaries to rewrite:

wordsDic = {}
inFile = open( sys.argv[1] )
for word in inFile.read().split():
if wordsDic.has_key( word ):
wordsDic[word] = wordsDic[word] + 1
else:
wordsDic[word] = 1

as:

wordsDic = {}
inFile = open( sys.argv[1] )
for word in inFile.read().split():
wordsDict[word]=wordsDict.get(word, 0)+1
and taking advantage of tuple expansion and % formatting

for pair in wordsLst:
outFile.write( str( pair[1] ).rjust( 7 ) + " : " + str( pair[0] ) + "\n")

as

for word, count in wordsLst:
outFile.write("%7s : %i\n" % (word, count))

I guess you assumed all your words were less than 7 characters long (which
I copied).

But there are many other "good" ways I'm sure.

Larry Bates

Qertoip wrote:
Would you like to suggest me any improvements for the following code?
I want to make my implementation as simple, as Python - native, as fine as
possible.

I've written simple code, which reads input text file and creates words'
ranking by number of appearence.

Code:
---------------------------------------------------------------------------
import sys

def moreCommonWord( x, y ):
if x[1] != y[1]:
return cmp( x[1], y[1] ) * -1
return cmp( x[0], y[0] )

wordsDic = {}
inFile = open( sys.argv[1] )
for word in inFile.read().split():
if wordsDic.has_key( word ):
wordsDic[word] = wordsDic[word] + 1
else:
wordsDic[word] = 1
inFile.close()

wordsLst = wordsDic.items()
wordsLst.sort( moreCommonWord )

outFile = open( sys.argv[2], 'w')
for pair in wordsLst:
outFile.write( str( pair[1] ).rjust( 7 ) + " : " + str( pair[0] ) + "\n" )
outFile.close()
---------------------------------------------------------------------------

In particular, I don't like reading whole file just to split it.
It is easy to read by lines - may I read by words with that ease?

PS I've been learning Python since todays morning, so be understanding :>

Jul 18 '05 #4

P: n/a
Qertoip wrote:
Dnia Fri, 25 Mar 2005 12:51:59 -0800, Scott David Daniels napisał(a):
...
for word in line.split():
try:
corpus[word] += 1
except KeyError:
corpus[word] = 1


Above is (probably) not efficient when exception is thrown, that is most of
the time (for any new word). However, I've just read about the following:
corpus[word] = corpus.setdefault( word, 0 ) + 1


That is better for things like:
corpus.setdefault(word, []).append(...)

You might prefer:

corpus[word] = corpus.get(word, 0) + 1

The trade-off depends on the size of your test material. You need
to time it with your mix of words. I was thinking of cranking
through a huge body of text (so words of frequency 1 are by far
the minority case). If you run through Shakespeare's first folio,
and just do the counting part, the try-except and .get cases are
indistinguishable (2.0 sec for each), and the .setdefault version
drags in at a slow 2.2 sec. Just going through Anna Karenina,
again .83, .83 and .91. So the .setdefault form is 10% slower.
For great test cases, (and for your own personal edification)
visit Project Gutenberg.

Beware when you do timing: whether the file is "warm" or not can
make a huge difference. Read through it once before timing either.
--Scott David Daniels
Sc***********@Acm.Org
Jul 18 '05 #5

P: n/a
Dnia Fri, 25 Mar 2005 19:17:30 +0100, Qertoip napisał(a):
Would you like to suggest me any improvements for the following code?
I want to make my implementation as simple, as Python - native, as fine as
possible. I've written simple code, which reads input text file and creates words'
ranking by number of appearence.


Good friend of mine heard about my attempt to create compact, simple Python
script and authored the following PHP script:

--------------------------------------------------------------------------
$data=join(' ', file($argv[1]));
foreach(explode(' ', $data) as $slowo)
$stat[chop($slowo)]++;

array_multisort($stat, SORT_DESC, array_keys($stat));

foreach($stat as $sl=>$il)
$odata.="$il : $sl\n";

file_put_contents($argv[2], $odata);
--------------------------------------------------------------------------

....which has the same functionality with less actual code lines [7].
I'm a little bit confused, since I considered Python more expressive then
PHP. The more I'm interested in improving my implementation now :)
It looks like this [11 actual code lines]:

--------------------------------------------------------------------------
import sys

corpus = {}
inFile = open( sys.argv[1] )
for word in inFile.read().split():
corpus[word] = corpus.get( word, 0 ) + 1
inFile.close()

words = sorted( ( -freq, word ) for word, freq in corpus.iteritems() )

outFile = open( sys.argv[2], 'w')
for negFreq, word in words:
outFile.write( '%7d : %s\n' % ( -negFreq, word ) )
outFile.close()
--------------------------------------------------------------------------

PS Thx 2 Scott David Daniels and Lary Bates
--
Regards,
Piotrek
Jul 18 '05 #6

P: n/a
Qertoip wrote:
Good friend of mine heard about my attempt to create compact, simple Python
script and authored the following PHP script: [snip 7-line PHP script] ...which has the same functionality with less actual code lines [7].
I'm a little bit confused, since I considered Python more expressive then
PHP. The more I'm interested in improving my implementation now :)
It looks like this [11 actual code lines]:

[snip 11-line Python]

If you want fewer lines, it's not too hard. Note that
if anyone has a problem with the style of this, I
point merely to the unreadable line noise that was
the PHP script as evidence that this is not about
style but conciseness:
--------------------------
import sys

corpus = {}
for word in open(sys.argv[1]).read().split():
corpus[word] = corpus.get( word, 0 ) + 1

words = reversed(sorted(data[::-1] for data in corpus.iteritems()))

open(sys.argv[2], 'w').writelines('%7d : %s\n' % data for data in words)
--------------------------

I'm curious if either this or the PHP does what is really
wanted, however. The above doesn't split on "words", but
merely on whitespace, making the results fairly meaningless
if you are concerned about punctuation etc.

-Peter
Jul 18 '05 #7

P: n/a
Dnia Fri, 25 Mar 2005 21:09:41 -0500, Peter Hansen napisał(a):

Thanks for comments! :)
Qertoip wrote:
Good friend of mine heard about my attempt to create compact, simple Python
script and authored the following PHP script: [snip 7-line PHP script]
...which has the same functionality with less actual code lines [7].
I'm a little bit confused, since I considered Python more expressive then
PHP. The more I'm interested in improving my implementation now :)
It looks like this [11 actual code lines]:

[snip 11-line Python]

--------------------------
import sys

corpus = {}
for word in open(sys.argv[1]).read().split():
corpus[word] = corpus.get( word, 0 ) + 1

words = reversed(sorted(data[::-1] for data in corpus.iteritems()))

open(sys.argv[2], 'w').writelines('%7d : %s\n' % data for data in words)
--------------------------
Is the file automatically closed in both cases?

I'm curious if either this or the PHP does what is really
wanted, however. The above doesn't split on "words", but
merely on whitespace, making the results fairly meaningless
if you are concerned about punctuation etc.


You are perfectly right, but the requirements were intentionally
simplified, ensuring 0-errors input and allowing words like 'hey,!1_go'

I aim to make my implementation as concise as possible but
*steel being natural, readable and clear*.
--
Pozdr.
Piotrek
Jul 18 '05 #8

P: n/a
Qertoip wrote:
Peter Hansen wrote:
--------------------------
import sys

corpus = {}
for word in open(sys.argv[1]).read().split():
corpus[word] = corpus.get( word, 0 ) + 1

words = reversed(sorted(data[::-1] for data in corpus.iteritems()))

open(sys.argv[2], 'w').writelines('%7d : %s\n' % data for data in words)
--------------------------


Is the file automatically closed in both cases?


Yes. Either as soon as the file object has no more
references to it (with CPython), or as soon as the
application terminates or the garbage collection
executes to reclaim the objects (Jython, IronPython,
and any other Pythons that don't use reference
counting).

Either way, for a small script, it's safe.

-Peter
Jul 18 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.