473,416 Members | 1,859 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,416 software developers and data experts.

Help Optimizing Word Search

Hi there I've just been playing around with some python code and I've
got a fun little optimization problem I could use some help with.

Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.
So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca,
cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa,
ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz

I'm using a trie structure to load and query my dictionary, which
returns a 1 if the word is found, 4 if a partial word is found, and 3
if there is no possible word.

I guess I'm wondering if anyone could think of a faster way to deal
with the wildcards, perhaps removing the inner for-loop or else
something entirely different (iterative, don't use the trie?).

Any other suggestions would be welcome

findword recursion runs in 9.16300010681 secs
words.log is simply my list of words:
['hats', 'easts', 'baizes',...,'sash']

--- Code Begin ---
PY> import trie
PY> import time
PY>
PY> print 'loading trie ...',
PY> mytrie = trie.Trie('dicts.txt')
PY> print 'done.'
PY>
PY> alpha = list('abcdefghijgklmnopqrstuvwxyz')
PY>
PY> words = {}
PY> def findword(word, letters):
PY> # run out of letters, recursion stop point
PY> if letters == []:
PY> return
PY>
PY> if word != "":
PY> # Check if word is valid
PY> result = mytrie.query(word)
PY>
PY> # Found a word, add it to the database
PY> if result == 1:
PY> words[word] = 1
PY>
PY> # No other word starts with my current word, recursion stop
point
PY> if result == 3:
PY> return
PY>
PY> # Run through every letter in our list
PY> for i,letter in enumerate(letters):
PY> # Remove current letter and recurse
PY> newletters = letters[:]
PY> newletters.pop(i)
PY>
PY> # if current letter is ? must recurse through all 26
letters
PY> if letter == '?':
PY> for c in alpha:
PY> # recurse word+wildcard letter
PY> findword(word+c,newletters)
PY> else:
PY> # recurse word+letter
PY> findword(word+letter,newletters)
PY>
PY> letters = list('abae?s?')
PY> s = time.time()
PY> findword("",letters)
PY> print time.time() - s
PY>
PY> output = open('words.log','w')
PY> print >> output, words.keys()
PY> output.close()
PY>
--- Code End ---
Thanks
(Hopefully google doesn't eat my whitespace)

Jul 18 '05 #1
10 3857
Case Nelson wrote:
Hi there I've just been playing around with some python code and I've
got a fun little optimization problem I could use some help with.

Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly. So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca, cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa, ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz


This appears to be a Computer Science 101 Data Structures and
Algorithms question, not a Python question, but here's an answer
anyway:

You appear to want to find all words that have one or more letters in
common with your query or candidate string.

Aside: Have you been following the long thread started by the poster
who appeared to want to store all possible strings that were _not_
words in a given language but could be generated from its alphabet?

Here's a possibility: use a bit mask approach. You attach a bit mask to
each word; simple data structure -- a list of 2-tuples, or two parallel
lists.

!def mask(word):
! m = 0
! for letter in word:
! m |= 1 << (ord(letter) - ord('a'))
! return m

Searching without wildcards:

!def nowc_search(candidate, mylistof2tuples):
! candmask = mask(candidate) # treating candidate as str, not list
! for word, mask in mylistof2tuples:
! if mask & candmask:
! # one or more letters in common
! yield word

Note: this treats "mississippi" and "misp" the same. If "aa" is in your
dictionary, what queries would retrieve it? Depending on your exact
requirements, this technique may suit you, or you may want to use it as
a fast(?) filter, with the matches it throws up needing further
checking. You may need a "count number of bits that are set in an int"
function.

Ref: Fred J. Damerau, "A technique for computer detection and
correction of spelling errors", CACM vol 7 number 3, March 1961.

Searching with wild cards: your example of query == "??" seems to yield
all two-letter words. I'd like to see what you expect for "a?", "?a",
"ab?", and "aa?" before suggesting how to tackle wild cards.
Reverse-engineering requirements out of other folks' code is not
something I do for fun :-)

An alternative for you to think about: instead of a bitmask, store the
letter-sorted transformation of the words: cat->act, act->act,
dog->dgo, god->dgo.

Alternative data structure: key = bitmask or sorted-letters, value =
list of all words that have that key.

A further suggestion which should always be considered when setting up
a search where the timing is worse than average O(1): have a separate
dictionary for each different wordlength, or some other
impossible-length-avoidance filter; that way, with minimum preliminary
calculation you can avoid considering words that are so long or so
short that they cannot possibly be matches. For example, with
approximate matching based on edit distance, if you are searching for a
10-letter word allowing for 2 errors, you can avoid doing the
complicated comparison on words shorter than 8 or longer than 12.
HTH,
John

Jul 18 '05 #2
"Case Nelson" <ca*********@gmail.com> writes:
Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.


For that size pattern and dictionary, simply compiling the pattern to
a regexp, joining the dictionary together into one big string ("abc
def ghijk..."), and matching the regexp against the big string, may
well be faster than using some fancy algorithm coded completely in
python.
Jul 18 '05 #3

Paul Rubin wrote:
"Case Nelson" <ca*********@gmail.com> writes:
Basically, the program needs to take in a random list of no more than 10 letters, and find all possible mutations that match a word in my dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time
significantly.
For that size pattern and dictionary, simply compiling the pattern to
a regexp, joining the dictionary together into one big string ("abc
def ghijk..."), and matching the regexp against the big string, may
well be faster than using some fancy algorithm coded completely in
python.


Paul, given the OP appears to want something like words that match any
(per)mutation of any substring of his query string -- and that's before
factoring in wildcards -- I'd like to see an example of a regexp that
could handle that.
Cheers,
John

Jul 18 '05 #4
To search for a word which is a jumble of a given set of characters in a
(sizable) lexicon, see this posting:

http://blog.vrplumber.com/427

your alterations would be to check for length == to length -
number-of-wildcards (with the wildcards removed from the translation
table, of course) and then some tweaks to the "expensive" loop to allow
for up to wildcard-count ValueErrors. There's some (informal) analysis
in the comments of that post regarding why its a fairly good mechanism
for searching large sets of words.

HTH,
Mike

Case Nelson wrote:
....
Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.
So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca,
cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa,
ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz

....

________________________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com

Jul 18 '05 #5
First of all thank you very much for the reply. I hope I'm not too
verbose here, just hope that if someone else runs into a similar
problem they can find an answer here.
This appears to be a Computer Science 101 Data Structures and
Algorithms question, not a Python question, but here's an answer
anyway
Heh, it's been awhile since I took data structures 101 but I asked here
after reading about python generators, and although I can't quite get
my head around them, I have a sense that they could help me check all
mutations (is this the right word?) of the candidate strings.

Thanks for the ideas on finding matches, I will have to try the bit
filter out to see if its faster. The trie works similar to this in that
it will return a 3 if the substring is not in the dictionary so that
you don't have to check any superset of the mutation. Although to
clarify, the reason I posted was mostly for help on my weak wildcard
algorithm.
If "aa" is in your dictionary, what queries would retrieve it?
Any query that contains either "a" and "a" or "a" and "?" or "?" and
"?". So some examples would be:
"aa", "a?", "aba", "baa", "bbbaba", "b?b?", "bab?", "?abbbbb" etc...
Searching with wild cards: your example of query == "??" seems to yield all two-letter words. I'd like to see what you expect for "a?", "?a",
"ab?", and "aa?" before suggesting how to tackle wild cards.
Reverse-engineering requirements out of other folks' code is not
something I do for fun :-)


My apologies, in my haste my example of what query == "??" yields was
unclear. Every list of candidate letters should yield every unique (if
possible) mutation, of length 1 to len(candidate).

Heh, exactly what each of those strings would yield is pretty much the
root of my problem:

?? yields (the order of the results dont matter):
a, b, c, d, e, f, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, bb,
bc, ..., ya, yb, yc, ..., yx, yy, yz, za, zb, zc, ..., zx, zy, zz

a? yields
a, b, c, d, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, ca, da, ...,
xa, ya, za

?a yields same as a?

ab? yields
a, b, c, d, ..., x, y, z, aa, ab, ac, ..., ax, ay, az, ba, bb, bc, ...,
bx, by, bz, ca, cb, da, db, ea, eb, ..., xa, xb, ya, yb, za, zb, aba,
abb, abc, abd, ..., abx, aby, abz, baa, bab, bac, ..., bax, bay, baz,
aab, acb, adb, ..., axb, ayb, azb, bba, bca, bda, ... bxa, bya, bza,
cba, dba, eba, xba, yba, zba

HTH, basically every possible string out of the given pool of up to 10
letters.

Ok writing these out just gave me an idea on how I could try and write
this another way.

Ok done. Wow, this program actually ran much much slower than my
original 135 seconds vs 9 seconds!! I'm not sure if I made a huge
mistake or if it's because in the original I can stop searching
superset-candidates if the candidate is not a substring of a dictionary
word (in the same way the mask would reject impossible mutations).

This produces the correct mutations however (although not unique as I
wrote out above):

-- begin code --
!def getmutations(pool):
!
! for i, letter in enumerate(pool):
! pool2 = pool[:]
! pool2.pop(i)
!
! for mutation in getmutations(pool2):
! if letter != '?':
! yield letter + mutation
! else:
! for c in alpha:
! yield c + mutation
!
! yield ""
!letters = list('abae?s?')
!s = time.time()
!output = open('words.log','w')
!for candidate in getmutations(letters):
! # Check if word is valid
! result = mytrie.query(candidate)
!
! # Found a word, add it to the database
! if result == 1:
! words[candidate] = 1
!
!print time.time() - s
--- end code ---

Jul 18 '05 #6
Mike,

Thanks, this worked great, and was 4x faster than my method!

Thanks everyone for replying!
The changes I made were:

!rest = ''.join([chr(i) for i in range(256) if chr(i).upper() not in
WORD])
!# a wildcard in the word means that we check all letters
!if '?' in WORD:
! rest = ''
!translator = string.maketrans('','')

and

! try:
! pool.remove( char )
! except ValueError:
! # nested try to remove possible wildcard
! try:
! pool.remove( '?' )
! except ValueError:
! return False

Jul 18 '05 #7
>>>>> "Case" == Case Nelson <ca*********@gmail.com> writes:
Hi there I've just been playing around with some python code and I've
got a fun little optimization problem I could use some help with. Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.
So if the letters are ['a','b','c'] check a, b, c, ab, ac, ba, bc, ca,
cb, abc, acb, bac, bca, cab, cba where only a, ba and cab would be
added to the dict of words. If the letters are ['?','?'] check a-z, aa,
ab, ac, ad, ..., az, ba, bb, bc, bd, ..., zz


You're trying to cheat at scrabble aren't you ;-). I once wrote a solution
using regular expressions, which I brushed up and posted here. Can you
check how it compares with your version?

Call as "wordsearch abcd". You can also use wildcards.

=====
..#!/usr/bin/python
..
..import sys
..import os
..import re
..
..wordlist = "/usr/share/dict/words"
..
..def get_freq(str):
.. """Create a hash of number of times a character occurs in string"""
.. freq = { }
.. for c in str:
.. freq[c] = freq.get(c, 0) + 1
.. return freq
..
..
..def main(argv = sys.argv):
.. chars = argv[1]
.. if argv[2]:
.. wordlist = argv[2]
.. wild = chars.count("?")
.. if wild:
.. chars = chars.replace("?", "") # remove the wild cards
.. if chars:
.. charpat = "[" + chars + "]*"
.. else:
.. charpat = ""
..
.. # create a regexp suitable for matching
.. pattern = charpat
.. for w in xrange(wild):
.. pattern += ".?" + charpat
.. pattern = "^" + pattern + "$"
.. print >> sys.stderr, pattern
..
.. # char count for our list of chars
.. charfreq = get_freq(chars)
..
.. pattern_re = re.compile(pattern, re.IGNORECASE)
.. # for line in os.popen("zegrep -i '%s' %s" % (pattern, wordlist)):
.. for line in open(wordlist):
.. line = line.rstrip()
.. if not pattern_re.search(line):
.. continue
.. if not line or len(line) > len(chars) + wild:
.. continue
.. line = line.lower()
.. linefreq = get_freq(line)
..
.. # check if chars occur fewer times than in the given string
.. extra = 0
.. for c in linefreq.keys():
.. if linefreq[c] > charfreq.get(c, 0):
.. extra += linefreq[c] - charfreq.get(c, 0)
.. if extra > wild:
.. break
.. else:
.. print line
..
..if __name__=='__main__':
.. main()

=====
Jul 18 '05 #8
Win
Hi Case.

Just in case you're really, truly looking for a fast Scrabble
word-search algorithm, the classic AI/CS article here is:

Andrew W. Appel and Guy J. Jacobson, The world's fastest Scrabble
program, Communications of the ACM, 31(5), pp 572--578, May 1988.

You can find a copy of this article at, for example:
http://counties.csc.calpoly.edu/~tea...le_program.pdf

This algorithm uses a "dawg" (Directed Acyclic Word Graph). You can
find a well-documented example of this data structure implemented in
Python in the the WordUtils package:
http://cedar-solutions.com/software/wordutils/

Regards,

Win

Jul 18 '05 #9
Paul Rubin wrote:
"Case Nelson" <ca*********@gmail.com> writes:
Basically, the program needs to take in a random list of no more than
10 letters, and find all possible mutations that match a word in my
dictionary (80k words). However a wildcard letter '?' is also an
acceptable character which increases the worst case time significantly.

For that size pattern and dictionary, simply compiling the pattern to
a regexp, joining the dictionary together into one big string ("abc
def ghijk..."), and matching the regexp against the big string, may
well be faster than using some fancy algorithm coded completely in
python.


If these queries happen often, build a dictionary of sorted letters to
lists-of-words. The setup will be slow, but use of such a secondary
dictionary should be quite fast, with relatively easy dealing with ?s
by iterations.

--Scott David Daniels
Sc***********@Acm.Org
Jul 18 '05 #10
With a list of letters: 'ABAE?S?' your implementation ran 3.5 times
faster than the one from http://blog.vrplumber.com/427 (in 0.437
seconds vs 1.515)

Without wildcards yours runs slightly quicker as well.

I guess with the wildcards, using an re as a quick filter against each
word, versus the translate method is much faster.

Jul 18 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: ArShAm | last post by:
Hi there Please help me to optimize this code for speed I added /O2 to compiler settings I added /Oe to compiler settings for accepting register type request , but it seems that is not allowed...
14
by: vic | last post by:
My manager wants me to develop a search program, that would work like they have it at edorado.com. She made up her requirements after having compared how search works at different websites, like...
4
by: Tarique Jawed | last post by:
Alright I needed some help regarding a removal of a binary search tree. Yes its for a class, and yes I have tried working on it on my own, so no patronizing please. I have most of the code working,...
2
by: Andrew Cranwell | last post by:
Hi, Please redirect me if I am in the wrong group for this question... this is my first post to this group, so I hope it is clear! I am working on an automation project in VB.NET and have to...
1
by: Rahul | last post by:
Hi Everybody I have some problem in my script. please help me. This is script file. I have one *.inq file. I want run this script in XML files. But this script errors shows . If u want i am...
24
by: Richard G. Riley | last post by:
Without resorting to asm chunks I'm working on a few small routines which manipulate bitmasks. I'm looking for any guidance on writing C in a manner which tilts the compilers hand in, if possible,...
66
by: genestarwing | last post by:
QUESTION: Write a program that opens and read a text file and records how many times each word occurs in the file. Use a binary search tree modified to store both a word and the number of times it...
9
by: Igal | last post by:
hay. i have read some topics about this. but there's something i don't quite understand, if anyone can help. i have text, in it i want to highlight the words user used to find this result. i...
0
by: alivip | last post by:
Is python provide search in parent folder contain sub folders and files for example folder name is cars and sub file is Toyota,Honda and BMW and Toyota contain file name camry and file name corola,...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.