Travers Naran wrote:
Here's the basic idea. I have a dictionary of substrings (the
substrings stored as keys). I have a list of strings. I want to find
out, for each word in the dictionary, how many times the substring
occurs non-overlapping occurrences there are in the list of strings.
This is the best I could come up with:
for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c
I tried Boyer-Moore-Horspool as someone else suggested in the group, but
it was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up
before the inner loop and only testing the pattern in the inner loop, it
was still slow.
Are you comparing BMH implemented in Python with str.count() implemented
in C?
So, am I doing this the optimal way or am I overlooking something
_really_ simple to do?
Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.
1. A pure Python suggestion:
Presuming there is a character SEP that cannot appear in the keys in
your query dictionary, try this:
SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict[query_key]['count'] += big_string_count(query_key)
Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:
self.dict_count = {}
SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict_count[query_key] = big_string_count(query_key)
Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?
2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.
http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_position + 1).
3. If you want to roll your own, start with Gonzalo Navarro's
publications:
http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html
HTH,
John