473,594 Members | 2,692 Online

# Counting occurences of words in a list of strings

Here's the basic idea. I have a dictionary of substrings (the substrings
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:

for j in self.dict.keys( ):
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.

So, am I doing this the optimal way or am I overlooking something _really_
simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.
Jul 19 '05 #1
4 6836
Travers Naran wrote:
Here's the basic idea. I have a dictionary of substrings (the substrings
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:

for j in self.dict.keys( ):
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.

So, am I doing this the optimal way or am I overlooking something _really_
simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.

Find a character that doesn't exist in any of self.dict.keys( ). '\x00'
will probably do.

whole = '\x00'.join(tc. strings)
for key in self.dict.keys( ):
self.dict[key]['count'] = whole.count(key )

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Jul 19 '05 #2
Travers Naran wrote:
Here's the basic idea. I have a dictionary of substrings (the
substrings stored as keys). I have a list of strings. I want to find
out, for each word in the dictionary, how many times the substring
occurs non-overlapping occurrences there are in the list of strings.
This is the best I could come up with:

for j in self.dict.keys( ):
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but
it was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up
before the inner loop and only testing the pattern in the inner loop, it
was still slow.
Are you comparing BMH implemented in Python with str.count() implemented
in C?

So, am I doing this the optimal way or am I overlooking something
_really_ simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.

1. A pure Python suggestion:

Presuming there is a character SEP that cannot appear in the keys in

SEP = '\n' # for example
big_string_coun t = SEP.join(tc.str ings).count
for query_key in self.dict.keys( ):
self.dict[query_key]['count'] += big_string_coun t(query_key)

Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:

self.dict_count = {}
SEP = '\n' # for example
big_string_coun t = SEP.join(tc.str ings).count
for query_key in self.dict.keys( ):
self.dict_count[query_key] = big_string_coun t(query_key)

Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?

2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_positi on + 1).

publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

HTH,
John
Jul 19 '05 #3
John Machin wrote:
Are you comparing BMH implemented in Python with str.count() implemented
in C?
Tee-hee. Yes. I figured out that was the problem pretty quickly and just
went back to count().
1. A pure Python suggestion:

Presuming there is a character SEP that cannot appear in the keys in

SEP = '\n' # for example
big_string_coun t = SEP.join(tc.str ings).count
for query_key in self.dict.keys( ):
self.dict[query_key]['count'] += big_string_coun t(query_key)

Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:
There is. Because the program can loop over mutliple input files, it needs
to count it across the files. But that's a minor thing.
Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?
Exactly. Let me be more precise: non-overlapping with itself. I.e., if I
has "aa", then "aaaa" should count 2. But foorbar and barzot both coming
back 1 in your example is exactly the behavoir I want.
2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_positi on + 1).
Nifty! Thanks muchly.
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

I don't suffer from NMH syndrome. If ahocorasick does the job, or even
count, I'm OK with that.
Jul 19 '05 #4
Travers Naran wrote:
John Machin wrote:
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

I don't suffer from NMH syndrome. If ahocorasick does the job, or even
count, I'm OK with that.

Do you mean NIH syndrome? Sorry, I should have been clearer, like "if
Aho-Corasick algorithm is about 30 years old. Navarro is part of, and
summarises the rest of, the state of the art.

Cheers,
John
Jul 19 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.