473,594 Members | 2,692 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Counting occurences of words in a list of strings

Here's the basic idea. I have a dictionary of substrings (the substrings
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:

for j in self.dict.keys( ):
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.

So, am I doing this the optimal way or am I overlooking something _really_
simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.
Jul 19 '05 #1
4 6836
Travers Naran wrote:
Here's the basic idea. I have a dictionary of substrings (the substrings
stored as keys). I have a list of strings. I want to find out, for each
word in the dictionary, how many times the substring occurs non-overlapping
occurrences there are in the list of strings. This is the best I could
come up with:

for j in self.dict.keys( ):
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up before
the inner loop and only testing the pattern in the inner loop, it was still
slow.

So, am I doing this the optimal way or am I overlooking something _really_
simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.


Find a character that doesn't exist in any of self.dict.keys( ). '\x00'
will probably do.

whole = '\x00'.join(tc. strings)
for key in self.dict.keys( ):
self.dict[key]['count'] = whole.count(key )

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Jul 19 '05 #2
Travers Naran wrote:
Here's the basic idea. I have a dictionary of substrings (the
substrings stored as keys). I have a list of strings. I want to find
out, for each word in the dictionary, how many times the substring
occurs non-overlapping occurrences there are in the list of strings.
This is the best I could come up with:

for j in self.dict.keys( ):
c = 0
for i in tc.strings:
c += i.count(j)
self.dict[j]['count'] += c

I tried Boyer-Moore-Horspool as someone else suggested in the group, but
it was even slower (probably because of the time spent setting up the
pattern). Even when I split the algorithm in two, doing the set-up
before the inner loop and only testing the pattern in the inner loop, it
was still slow.
Are you comparing BMH implemented in Python with str.count() implemented
in C?

So, am I doing this the optimal way or am I overlooking something
_really_ simple to do?

Statistics on the data:
The dictionary has 3000+ entries in it.
There are several thousand strings.
The strings average 10-26 characters in length.


1. A pure Python suggestion:

Presuming there is a character SEP that cannot appear in the keys in
your query dictionary, try this:

SEP = '\n' # for example
big_string_coun t = SEP.join(tc.str ings).count
for query_key in self.dict.keys( ):
self.dict[query_key]['count'] += big_string_coun t(query_key)

Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:

self.dict_count = {}
SEP = '\n' # for example
big_string_coun t = SEP.join(tc.str ings).count
for query_key in self.dict.keys( ):
self.dict_count[query_key] = big_string_coun t(query_key)

Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?

2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_positi on + 1).

3. If you want to roll your own, start with Gonzalo Navarro's
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

HTH,
John
Jul 19 '05 #3
John Machin wrote:
Are you comparing BMH implemented in Python with str.count() implemented
in C?
Tee-hee. Yes. I figured out that was the problem pretty quickly and just
went back to count().
1. A pure Python suggestion:

Presuming there is a character SEP that cannot appear in the keys in
your query dictionary, try this:

SEP = '\n' # for example
big_string_coun t = SEP.join(tc.str ings).count
for query_key in self.dict.keys( ):
self.dict[query_key]['count'] += big_string_coun t(query_key)

Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:
There is. Because the program can loop over mutliple input files, it needs
to count it across the files. But that's a minor thing.
Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?
Exactly. Let me be more precise: non-overlapping with itself. I.e., if I
has "aa", then "aaaa" should count 2. But foorbar and barzot both coming
back 1 in your example is exactly the behavoir I want.
2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_positi on + 1).
Nifty! Thanks muchly.
3. If you want to roll your own, start with Gonzalo Navarro's
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html


I don't suffer from NMH syndrome. If ahocorasick does the job, or even
count, I'm OK with that.
Jul 19 '05 #4
Travers Naran wrote:
John Machin wrote:
3. If you want to roll your own, start with Gonzalo Navarro's
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

I don't suffer from NMH syndrome. If ahocorasick does the job, or even
count, I'm OK with that.


Do you mean NIH syndrome? Sorry, I should have been clearer, like "if
you want faster, you will have to roll your own; start with ...". The
Aho-Corasick algorithm is about 30 years old. Navarro is part of, and
summarises the rest of, the state of the art.

Cheers,
John
Jul 19 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
2992
by: NiQ | last post by:
Hello every one, this is about an array of string and may be have done several time before but i need it and wasnt able to find it so here is the problem i have an array of strings with contains words and i want a distinct list of array with the count of each number repeat i.e how many time it exist in the array and the final result should be sorted according to the number of repeat from hight to low . for example
5
2689
by: Cam Bevis | last post by:
I'm having trouble getting my head around this, and no one in the groups has posted exactly the problem. The table below tracks site traffic across a network. There is 1 row per pageview and UUID is that user's unique cookie. CREATE TABLE . ( NOT NULL , NULL , NULL ,
4
6565
by: Victor Engmark | last post by:
When looking for a method to fetch unique elements and counting the number of occurences of each of them, I found quite a lot of gross examples of complex XSL. But after realizing the subtle difference between "." and "current()", I found a neat way of doing the same without keys or generate-id(): <xsl:template match="/"> <!-- Selects all "new" elements --> <xsl:for-each select="//Name"> <!-- Display the element -->
4
3485
by: Ralph Noble | last post by:
Does anyone know of a string function in Access that will allow me to count the number of instances one string occurs within another? Or if there is some sort of word count function? If there is, I'm not aware of it. I've tried the following unsuccessfully ... SELECT *, (Len()-Len(Replace(,' what_im_looking_to_count','')))/(Len(' what_im_looking_to_count')) AS KeywordFoundCountFROM tblYourTable; That'll find how many times the phrase...
4
4064
by: sun6 | last post by:
this is a program counting words from "text_in.txt" file and writing them in "text_out.txt". it uses binary tree search, but there is an error when i use insert () thanks for any help #include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h>
4
13634
by: Jason Gleason | last post by:
What's the most efficient way to get the number of occurences of a certain string in another string..for instance i'm using the following code right now... private int CharacterCounter(String text,String Character) { int count = 0;
7
5444
by: Timo Haberkern | last post by:
Hi there, i have some troubles with my TSearch2 Installation. I have done this installation as described in http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_compound_words <http://www.sai.msu.su/%7Emegera/oddmuse/index.cgi/Tsearch_V2_compound_words> I used the german myspell dictionary from http://lingucomponent.openoffice.org/spell_dic.html and converted it with my2ispell
3
6346
by: Nhd | last post by:
I have a question which involves reading from cin and counting the number of words read until the end of file(eof). The question is as follows: Words are delimited by white spaces (blanks, tabs, linefeeds, returns). It will print 3 pieces of information before exit: the first word, the last word, and the number of words read, with one blank in between fields. Input lines of words Output
20
5114
by: dmurray14 | last post by:
Hey guys, I'm a C++ newbie here - I've messed with VB, but I mostly stick to web languages, so I find C++ to be very confusing at times. Basically, I am trying to import a text file, but I want to do it word by word. I am confused as to how to do this. Typically, I would think it would make sense to try and input the words into strings, but for this application I need to use character arrays and pointers. So what's the best way to go...
0
7947
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
7880
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
8010
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8242
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6665
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5413
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
3868
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
3903
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1486
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.