473,325 Members | 2,608 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,325 software developers and data experts.

Counting occurences of words in a list of strings

Here's the basic idea. I have a dictionary of substrings (the substrings
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:

for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.

So, am I doing this the optimal way or am I overlooking something _really_
simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.
Jul 19 '05 #1
4 6815
Travers Naran wrote:
Here's the basic idea. I have a dictionary of substrings (the substrings
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:

for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.

So, am I doing this the optimal way or am I overlooking something _really_
simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.


Find a character that doesn't exist in any of self.dict.keys(). '\x00'
will probably do.

whole = '\x00'.join(tc.strings)
for key in self.dict.keys():
self.dict[key]['count'] = whole.count(key)

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Jul 19 '05 #2
Travers Naran wrote:
Here's the basic idea. I have a dictionary of substrings (the
substrings stored as keys). I have a list of strings. I want to find
out, for each word in the dictionary, how many times the substring
occurs non-overlapping occurrences there are in the list of strings.
This is the best I could come up with:

for j in self.dict.keys():
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but
it was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up
before the inner loop and only testing the pattern in the inner loop, it
was still slow.
Are you comparing BMH implemented in Python with str.count() implemented
in C?

So, am I doing this the optimal way or am I overlooking something
_really_ simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.


1. A pure Python suggestion:

Presuming there is a character SEP that cannot appear in the keys in
your query dictionary, try this:

SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict[query_key]['count'] += big_string_count(query_key)

Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:

self.dict_count = {}
SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict_count[query_key] = big_string_count(query_key)

Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?

2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_position + 1).

3. If you want to roll your own, start with Gonzalo Navarro's
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

HTH,
John
Jul 19 '05 #3
John Machin wrote:
Are you comparing BMH implemented in Python with str.count() implemented
in C?
Tee-hee. Yes. I figured out that was the problem pretty quickly and just
went back to count().
1. A pure Python suggestion:

Presuming there is a character SEP that cannot appear in the keys in
your query dictionary, try this:

SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict[query_key]['count'] += big_string_count(query_key)

Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:
There is. Because the program can loop over mutliple input files, it needs
to count it across the files. But that's a minor thing.
Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?
Exactly. Let me be more precise: non-overlapping with itself. I.e., if I
has "aa", then "aaaa" should count 2. But foorbar and barzot both coming
back 1 in your example is exactly the behavoir I want.
2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_position + 1).
Nifty! Thanks muchly.
3. If you want to roll your own, start with Gonzalo Navarro's
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html


I don't suffer from NMH syndrome. If ahocorasick does the job, or even
count, I'm OK with that.
Jul 19 '05 #4
Travers Naran wrote:
John Machin wrote:
3. If you want to roll your own, start with Gonzalo Navarro's
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

I don't suffer from NMH syndrome. If ahocorasick does the job, or even
count, I'm OK with that.


Do you mean NIH syndrome? Sorry, I should have been clearer, like "if
you want faster, you will have to roll your own; start with ...". The
Aho-Corasick algorithm is about 30 years old. Navarro is part of, and
summarises the rest of, the state of the art.

Cheers,
John
Jul 19 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: NiQ | last post by:
Hello every one, this is about an array of string and may be have done several time before but i need it and wasnt able to find it so here is the problem i have an array of strings with contains...
5
by: Cam Bevis | last post by:
I'm having trouble getting my head around this, and no one in the groups has posted exactly the problem. The table below tracks site traffic across a network. There is 1 row per pageview and...
4
by: Victor Engmark | last post by:
When looking for a method to fetch unique elements and counting the number of occurences of each of them, I found quite a lot of gross examples of complex XSL. But after realizing the subtle...
4
by: Ralph Noble | last post by:
Does anyone know of a string function in Access that will allow me to count the number of instances one string occurs within another? Or if there is some sort of word count function? If there is,...
4
by: sun6 | last post by:
this is a program counting words from "text_in.txt" file and writing them in "text_out.txt". it uses binary tree search, but there is an error when i use insert () thanks for any help ...
4
by: Jason Gleason | last post by:
What's the most efficient way to get the number of occurences of a certain string in another string..for instance i'm using the following code right now... private int CharacterCounter(String...
7
by: Timo Haberkern | last post by:
Hi there, i have some troubles with my TSearch2 Installation. I have done this installation as described in http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_compound_words...
3
by: Nhd | last post by:
I have a question which involves reading from cin and counting the number of words read until the end of file(eof). The question is as follows: Words are delimited by white spaces (blanks,...
20
by: dmurray14 | last post by:
Hey guys, I'm a C++ newbie here - I've messed with VB, but I mostly stick to web languages, so I find C++ to be very confusing at times. Basically, I am trying to import a text file, but I want...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.