473,854 Members | 1,576 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Help Optimizing Word Search

Hi there I've just been playing around with some python code and I've
got a fun little optimization problem I could use some help with.

Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.
So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca,
cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa,
ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz

I'm using a trie structure to load and query my dictionary, which
returns a 1 if the word is found, 4 if a partial word is found, and 3
if there is no possible word.

I guess I'm wondering if anyone could think of a faster way to deal
with the wildcards, perhaps removing the inner for-loop or else
something entirely different (iterative, don't use the trie?).

Any other suggestions would be welcome

findword recursion runs in 9.16300010681 secs
words.log is simply my list of words:
['hats', 'easts', 'baizes',...,'s ash']

--- Code Begin ---
PY> import trie
PY> import time
PY>
PY> print 'loading trie ...',
PY> mytrie = trie.Trie('dict s.txt')
PY> print 'done.'
PY>
PY> alpha = list('abcdefghi jgklmnopqrstuvw xyz')
PY>
PY> words = {}
PY> def findword(word, letters):
PY> # run out of letters, recursion stop point
PY> if letters == []:
PY> return
PY>
PY> if word != "":
PY> # Check if word is valid
PY> result = mytrie.query(wo rd)
PY>
PY> # Found a word, add it to the database
PY> if result == 1:
PY> words[word] = 1
PY>
PY> # No other word starts with my current word, recursion stop
point
PY> if result == 3:
PY> return
PY>
PY> # Run through every letter in our list
PY> for i,letter in enumerate(lette rs):
PY> # Remove current letter and recurse
PY> newletters = letters[:]
PY> newletters.pop( i)
PY>
PY> # if current letter is ? must recurse through all 26
letters
PY> if letter == '?':
PY> for c in alpha:
PY> # recurse word+wildcard letter
PY> findword(word+c ,newletters)
PY> else:
PY> # recurse word+letter
PY> findword(word+l etter,newletter s)
PY>
PY> letters = list('abae?s?')
PY> s = time.time()
PY> findword("",let ters)
PY> print time.time() - s
PY>
PY> output = open('words.log ','w')
PY> print >> output, words.keys()
PY> output.close()
PY>
--- Code End ---
Thanks
(Hopefully google doesn't eat my whitespace)

Jul 18 '05 #1
10 3899
Case Nelson wrote:
Hi there I've just been playing around with some python code and I've
got a fun little optimization problem I could use some help with.

Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly. So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca, cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa, ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz


This appears to be a Computer Science 101 Data Structures and
Algorithms question, not a Python question, but here's an answer
anyway:

You appear to want to find all words that have one or more letters in
common with your query or candidate string.

Aside: Have you been following the long thread started by the poster
who appeared to want to store all possible strings that were _not_
words in a given language but could be generated from its alphabet?

Here's a possibility: use a bit mask approach. You attach a bit mask to
each word; simple data structure -- a list of 2-tuples, or two parallel
lists.

!def mask(word):
! m = 0
! for letter in word:
! m |= 1 << (ord(letter) - ord('a'))
! return m

Searching without wildcards:

!def nowc_search(can didate, mylistof2tuples ):
! candmask = mask(candidate) # treating candidate as str, not list
! for word, mask in mylistof2tuples :
! if mask & candmask:
! # one or more letters in common
! yield word

Note: this treats "mississipp i" and "misp" the same. If "aa" is in your
dictionary, what queries would retrieve it? Depending on your exact
requirements, this technique may suit you, or you may want to use it as
a fast(?) filter, with the matches it throws up needing further
checking. You may need a "count number of bits that are set in an int"
function.

Ref: Fred J. Damerau, "A technique for computer detection and
correction of spelling errors", CACM vol 7 number 3, March 1961.

Searching with wild cards: your example of query == "??" seems to yield
all two-letter words. I'd like to see what you expect for "a?", "?a",
"ab?", and "aa?" before suggesting how to tackle wild cards.
Reverse-engineering requirements out of other folks' code is not
something I do for fun :-)

An alternative for you to think about: instead of a bitmask, store the
letter-sorted transformation of the words: cat->act, act->act,
dog->dgo, god->dgo.

Alternative data structure: key = bitmask or sorted-letters, value =
list of all words that have that key.

A further suggestion which should always be considered when setting up
a search where the timing is worse than average O(1): have a separate
dictionary for each different wordlength, or some other
impossible-length-avoidance filter; that way, with minimum preliminary
calculation you can avoid considering words that are so long or so
short that they cannot possibly be matches. For example, with
approximate matching based on edit distance, if you are searching for a
10-letter word allowing for 2 errors, you can avoid doing the
complicated comparison on words shorter than 8 or longer than 12.
HTH,
John

Jul 18 '05 #2
"Case Nelson" <ca*********@gm ail.com> writes:
Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.


For that size pattern and dictionary, simply compiling the pattern to
a regexp, joining the dictionary together into one big string ("abc
def ghijk..."), and matching the regexp against the big string, may
well be faster than using some fancy algorithm coded completely in
python.
Jul 18 '05 #3

Paul Rubin wrote:
"Case Nelson" <ca*********@gm ail.com> writes:
Basically, the program needs to take in a random list of no more than 10 letters, and find all possible mutations that match a word in my dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time
significantly.
For that size pattern and dictionary, simply compiling the pattern to
a regexp, joining the dictionary together into one big string ("abc
def ghijk..."), and matching the regexp against the big string, may
well be faster than using some fancy algorithm coded completely in
python.


Paul, given the OP appears to want something like words that match any
(per)mutation of any substring of his query string -- and that's before
factoring in wildcards -- I'd like to see an example of a regexp that
could handle that.
Cheers,
John

Jul 18 '05 #4
To search for a word which is a jumble of a given set of characters in a
(sizable) lexicon, see this posting:

http://blog.vrplumber.com/427

your alterations would be to check for length == to length -
number-of-wildcards (with the wildcards removed from the translation
table, of course) and then some tweaks to the "expensive" loop to allow
for up to wildcard-count ValueErrors. There's some (informal) analysis
in the comments of that post regarding why its a fairly good mechanism
for searching large sets of words.

HTH,
Mike

Case Nelson wrote:
....
Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.
So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca,
cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa,
ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz

....

_______________ _______________ _______________ ___
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com

Jul 18 '05 #5
First of all thank you very much for the reply. I hope I'm not too
verbose here, just hope that if someone else runs into a similar
problem they can find an answer here.
This appears to be a Computer Science 101 Data Structures and
Algorithms question, not a Python question, but here's an answer
anyway
Heh, it's been awhile since I took data structures 101 but I asked here
after reading about python generators, and although I can't quite get
my head around them, I have a sense that they could help me check all
mutations (is this the right word?) of the candidate strings.

Thanks for the ideas on finding matches, I will have to try the bit
filter out to see if its faster. The trie works similar to this in that
it will return a 3 if the substring is not in the dictionary so that
you don't have to check any superset of the mutation. Although to
clarify, the reason I posted was mostly for help on my weak wildcard
algorithm.
If "aa" is in your dictionary, what queries would retrieve it?
Any query that contains either "a" and "a" or "a" and "?" or "?" and
"?". So some examples would be:
"aa", "a?", "aba", "baa", "bbbaba", "b?b?", "bab?", "?abbbbb" etc...
Searching with wild cards: your example of query == "??" seems to yield all two-letter words. I'd like to see what you expect for "a?", "?a",
"ab?", and "aa?" before suggesting how to tackle wild cards.
Reverse-engineering requirements out of other folks' code is not
something I do for fun :-)


My apologies, in my haste my example of what query == "??" yields was
unclear. Every list of candidate letters should yield every unique (if
possible) mutation, of length 1 to len(candidate).

Heh, exactly what each of those strings would yield is pretty much the
root of my problem:

?? yields (the order of the results dont matter):
a, b, c, d, e, f, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, bb,
bc, ..., ya, yb, yc, ..., yx, yy, yz, za, zb, zc, ..., zx, zy, zz

a? yields
a, b, c, d, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, ca, da, ...,
xa, ya, za

?a yields same as a?

ab? yields
a, b, c, d, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, bb, bc, ...,
bx, by, bz, ca, cb, da, db, ea, eb, ..., xa, xb, ya, yb, za, zb, aba,
abb, abc, abd, ..., abx, aby, abz, baa, bab, bac, ..., bax, bay, baz,
aab, acb, adb, ..., axb, ayb, azb, bba, bca, bda, ... bxa, bya, bza,
cba, dba, eba, xba, yba, zba

HTH, basically every possible string out of the given pool of up to 10
letters.

Ok writing these out just gave me an idea on how I could try and write
this another way.

Ok done. Wow, this program actually ran much much slower than my
original 135 seconds vs 9 seconds!! I'm not sure if I made a huge
mistake or if it's because in the original I can stop searching
superset-candidates if the candidate is not a substring of a dictionary
word (in the same way the mask would reject impossible mutations).

This produces the correct mutations however (although not unique as I
wrote out above):

-- begin code --
!def getmutations(po ol):
!
! for i, letter in enumerate(pool) :
! pool2 = pool[:]
! pool2.pop(i)
!
! for mutation in getmutations(po ol2):
! if letter != '?':
! yield letter + mutation
! else:
! for c in alpha:
! yield c + mutation
!
! yield ""
!letters = list('abae?s?')
!s = time.time()
!output = open('words.log ','w')
!for candidate in getmutations(le tters):
! # Check if word is valid
! result = mytrie.query(ca ndidate)
!
! # Found a word, add it to the database
! if result == 1:
! words[candidate] = 1
!
!print time.time() - s
--- end code ---

Jul 18 '05 #6
Mike,

Thanks, this worked great, and was 4x faster than my method!

Thanks everyone for replying!
The changes I made were:

!rest = ''.join([chr(i) for i in range(256) if chr(i).upper() not in
WORD])
!# a wildcard in the word means that we check all letters
!if '?' in WORD:
! rest = ''
!translator = string.maketran s('','')

and

! try:
! pool.remove( char )
! except ValueError:
! # nested try to remove possible wildcard
! try:
! pool.remove( '?' )
! except ValueError:
! return False

Jul 18 '05 #7
>>>>> "Case" == Case Nelson <ca*********@gm ail.com> writes:
Hi there I've just been playing around with some python code and I've
got a fun little optimization problem I could use some help with. Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.
So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca,
cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa,
ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz


You're trying to cheat at scrabble aren't you ;-). I once wrote a solution
using regular expressions, which I brushed up and posted here. Can you
check how it compares with your version?

Call as "wordsearch abcd". You can also use wildcards.

=====
..#!/usr/bin/python
..
..import sys
..import os
..import re
..
..wordlist = "/usr/share/dict/words"
..
..def get_freq(str):
.. """Create a hash of number of times a character occurs in string"""
.. freq = { }
.. for c in str:
.. freq[c] = freq.get(c, 0) + 1
.. return freq
..
..
..def main(argv = sys.argv):
.. chars = argv[1]
.. if argv[2]:
.. wordlist = argv[2]
.. wild = chars.count("?" )
.. if wild:
.. chars = chars.replace(" ?", "") # remove the wild cards
.. if chars:
.. charpat = "[" + chars + "]*"
.. else:
.. charpat = ""
..
.. # create a regexp suitable for matching
.. pattern = charpat
.. for w in xrange(wild):
.. pattern += ".?" + charpat
.. pattern = "^" + pattern + "$"
.. print >> sys.stderr, pattern
..
.. # char count for our list of chars
.. charfreq = get_freq(chars)
..
.. pattern_re = re.compile(patt ern, re.IGNORECASE)
.. # for line in os.popen("zegre p -i '%s' %s" % (pattern, wordlist)):
.. for line in open(wordlist):
.. line = line.rstrip()
.. if not pattern_re.sear ch(line):
.. continue
.. if not line or len(line) > len(chars) + wild:
.. continue
.. line = line.lower()
.. linefreq = get_freq(line)
..
.. # check if chars occur fewer times than in the given string
.. extra = 0
.. for c in linefreq.keys() :
.. if linefreq[c] > charfreq.get(c, 0):
.. extra += linefreq[c] - charfreq.get(c, 0)
.. if extra > wild:
.. break
.. else:
.. print line
..
..if __name__=='__ma in__':
.. main()

=====
Jul 18 '05 #8
Win
Hi Case.

Just in case you're really, truly looking for a fast Scrabble
word-search algorithm, the classic AI/CS article here is:

Andrew W. Appel and Guy J. Jacobson, The world's fastest Scrabble
program, Communications of the ACM, 31(5), pp 572--578, May 1988.

You can find a copy of this article at, for example:
http://counties.csc.calpoly.edu/~tea...le_program.pdf

This algorithm uses a "dawg" (Directed Acyclic Word Graph). You can
find a well-documented example of this data structure implemented in
Python in the the WordUtils package:
http://cedar-solutions.com/software/wordutils/

Regards,

Win

Jul 18 '05 #9
Paul Rubin wrote:
"Case Nelson" <ca*********@gm ail.com> writes:
Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.

For that size pattern and dictionary, simply compiling the pattern to
a regexp, joining the dictionary together into one big string ("abc
def ghijk..."), and matching the regexp against the big string, may
well be faster than using some fancy algorithm coded completely in
python.


If these queries happen often, build a dictionary of sorted letters to
lists-of-words. The setup will be slow, but use of such a secondary
dictionary should be quite fast, with relatively easy dealing with ?s
by iterations.

--Scott David Daniels
Sc***********@A cm.Org
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
3543
by: ArShAm | last post by:
Hi there Please help me to optimize this code for speed I added /O2 to compiler settings I added /Oe to compiler settings for accepting register type request , but it seems that is not allowed and if I remove register type for "l" , time of generating codes doesn't change the original code makes some files , but I removed that section to make it simple for you to read please help me to optimize it for faster running
14
4645
by: vic | last post by:
My manager wants me to develop a search program, that would work like they have it at edorado.com. She made up her requirements after having compared how search works at different websites, like eBay, Yahoo and others. This is what she wants my program to be able to do: (try this test at different websites just for fun). At eBay: - enter the word 'television' in a search field à you will get 2155 items.
4
9024
by: Tarique Jawed | last post by:
Alright I needed some help regarding a removal of a binary search tree. Yes its for a class, and yes I have tried working on it on my own, so no patronizing please. I have most of the code working, even the removal, I just don't know how to keep track of the parent, so that I can set its child to the child of the node to be removed. IE - if I had C / \ B D
2
3687
by: Andrew Cranwell | last post by:
Hi, Please redirect me if I am in the wrong group for this question... this is my first post to this group, so I hope it is clear! I am working on an automation project in VB.NET and have to interact with Word 2000 SP-3 to find and replace text within a word document. Unfortunately, development (not created by me!) occured with Word 2000 SR-1 and on ugrading the references and version to SP-3, some of the code inexplicably stopped...
1
3728
by: Rahul | last post by:
Hi Everybody I have some problem in my script. please help me. This is script file. I have one *.inq file. I want run this script in XML files. But this script errors shows . If u want i am attach this script files and inq files. I cant understand this error. Please suggest me. You can talk with my yahoo id b_sahoo1@yahoo.com. Now i am online. Plz....Plz..Plz...
24
3177
by: Richard G. Riley | last post by:
Without resorting to asm chunks I'm working on a few small routines which manipulate bitmasks. I'm looking for any guidance on writing C in a manner which tilts the compilers hand in, if possible, a compiler/underlying processor independant way : althought to be fair I cant see this stuff on anything other than x86, but who knows. I found some ok info here: http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm...
66
5429
by: genestarwing | last post by:
QUESTION: Write a program that opens and read a text file and records how many times each word occurs in the file. Use a binary search tree modified to store both a word and the number of times it occurs. After the program has read the file, it should offer a menu with three choices. the first is to list all the words along with the number of occurences. The second is to let you enter a word, with the program reporting how many times the...
9
1694
by: Igal | last post by:
hay. i have read some topics about this. but there's something i don't quite understand, if anyone can help. i have text, in it i want to highlight the words user used to find this result. i used the normal replace, it worked fine except the case thing... SearchForSp = Split(SearchFor, " ") For i = 0 to Ubound(SearchForSp) EmpJobsText = Replace(EmpJobsText, SearchForSp(i), "<strong><u><font
0
3879
by: alivip | last post by:
Is python provide search in parent folder contain sub folders and files for example folder name is cars and sub file is Toyota,Honda and BMW and Toyota contain file name camry and file name corola, file name honda contain folder accord and BMW contain file name X5 Is there way to enter name of parent folder(cars) and search in all sub folder(Toyota,Honda and BMW) and files ? how can I intgreat cod to be user interface (buttun ,text box...
0
9751
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11024
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10678
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10367
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9512
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7914
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
7079
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5740
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
3
3185
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.