By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,846 Members | 1,836 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,846 IT Pros & Developers. It's quick & easy.

searching algorithm

P: n/a
Hi all!

I have text file (english-croatian dictionary) with words in it in alphabetical
order.
This file contains 179999 words in this format:
english word: croatian word

I want to make instant search for my gui
Instant search, i mean that my program search words and show words to user as
user type letters.
yes, it needs to be fast

Can someone give me some points (approaches) how to make this
Should i make indexes and go with file.seek

or should breake dictionary in peaces and put words that start with a in one and
with b in another...
?????

So if i have this words
absinth:pelin
absinthe:pelin
absolute:apsolutan
absolute:apsolutni kod
absolute:apsolutno
absolute:čist
absolute:nesumnjiv
absolute:potpun
absolute:savršen
absolute coordinates:apsolutne koordinate
absolute frequency:apsolutna učestalost
absolute gap:apsolutni jaz
absolute line spacing:apsolutni međurazmak linija
absolute majority:apsolutna većina
absolute pointing device:apsolutni pokazivački uređaj
absolute quantity:apsolutni udio
absolute value:apsolutna vrijednost
absolute zero:apsolutna nula
absolutely:apsolutno
absolutely:bezuvjetno
absolutely:nezavisno
absolutely:potpuno
absolutely:samostalno
absolutely:sasvim
absolution:odrješenje
absolution:oproštaj
absolutism:apsolutizam
absolve:odriješiti
absolve:osloboditi
absorb:absorbirati
absorb:apsorbirati
absorb:crpsti

if user type: "abs" program should list all words above in english and in croatian
if user type: "absorb" than program should list last 3 words in english and in
croatian

any help would be appreciate!
my apologies for bad english
May 10 '07 #1
Share this Question
Share on Google+
15 Replies


P: n/a
I have made a script that search anagram based on the ODS file ( call
OSW in english, Official Scrabble Words)
it load a file that contain 369085 words (one word per line)

I create a dictionnary and store word into using the length of the
word as key
example : mydict[2] contain a list of word with length = 2

first I select the correct dict entry and in a second time I scan this
list searching correct word

my file contains 369085 and it's pretty fast

Seb

May 10 '07 #2

P: n/a
On 2007-05-10, Gigs_ <gi**@hi.t-com.hrwrote:
Hi all!

I have text file (english-croatian dictionary) with words in it
in alphabetical order. This file contains 179999 words in this
format: english word: croatian word

I want to make instant search for my gui Instant search, i mean
that my program search words and show words to user as user
type letters. yes, it needs to be fast

Can someone give me some points (approaches) how to make this
Should i make indexes and go with file.seek

or should breake dictionary in peaces and put words that start
with a in one and with b in another...

So if i have this words
[abridged dictionary below]
absinth:pelin
absinthe:pelin
absolute:apsolutan
absolute:apsolutni kod
absolve:odrije?iti
absolve:osloboditi
absorb:absorbirati
absorb:apsorbirati
absorb:crpsti

if user type: "abs" program should list all words above in
english and in croatian if user type: "absorb" than program
should list last 3 words in english and in croatian
A solution that solves the problem with a data structure might be
a multi-tree.

Each node points at a set of following letters, and a set of
croatian translations, either of which might be empty, making the
node a leaf.

For the above (abrideged) dictionary, you would generate (use a
fixed-width "programmers" font so the tree looks good):

a
|
b
|
s
/ \
i o
/ / \
n l r
/ / \ \
t u v b->(absorbirati, crpisti)
/ | |
(pelin)<-h t e->(odrije?iti, osloboditi)
| |
(pelin)<-e e->(apsolutan, apsolutni kod)

As the user enter letters, you just march down the tree, printing
all the words held in leaf nodes held in the current node.

--
Neil Cerutti
We shall reach greater and greater platitudes of achievement. --Richard J.
Daley
May 10 '07 #3

P: n/a

"Neil Cerutti" <ho*****@yahoo.comwrote in message
news:sl********************@FIAD06.norwich.edu...
| On 2007-05-10, Gigs_ <gi**@hi.t-com.hrwrote:
| if user type: "abs" program should list all words above in
| english and in croatian if user type: "absorb" than program
| should list last 3 words in english and in croatian
|
| A solution that solves the problem with a data structure might be
| a multi-tree.

Specific computer science terms are prefix tree or trie (from reTRIEval).
http://en.wikipedia.org/wiki/Trie
gives an introduction.

| Each node points at a set of following letters, and a set of
| croatian translations, either of which might be empty, making the
| node a leaf.
|
| For the above (abrideged) dictionary, you would generate (use a
| fixed-width "programmers" font so the tree looks good):
|
| a
| |
| b
| |
| s
| / \
| i o
| / / \
| n l r
| / / \ \
| t u v b->(absorbirati, crpisti)
| / | |
| (pelin)<-h t e->(odrije?iti, osloboditi)
| | |
| (pelin)<-e e->(apsolutan, apsolutni kod)
|
| As the user enter letters, you just march down the tree, printing
| all the words held in leaf nodes held in the current node.

tjr
|

May 10 '07 #4

P: n/a
For the above (abrideged) dictionary, you would generate (use a
fixed-width "programmers" font so the tree looks good):

a
|
b
|
s
/ \
i o
/ / \
n l r
/ / \ \
t u v b->(absorbirati, crpisti)
/ | |
(pelin)<-h t e->(odrije?iti, osloboditi)
| |
(pelin)<-e e->(apsolutan, apsolutni kod)

As the user enter letters, you just march down the tree, printing
all the words held in leaf nodes held in the current node.
Call me dense, but how does one do this in Python - which doesn't have
pointers? Dictionaries with dictionaries within dictionaries... (with
each letter as the key and the its children as values) is going to be
extremely space inefficient, right?
May 10 '07 #5

P: n/a
Gordon Airporte wrote:
>
>For the above (abrideged) dictionary, you would generate (use a
fixed-width "programmers" font so the tree looks good):

a
|
b
|
s
/ \
i o
/ / \
n l r
/ / \ \
t u v b->(absorbirati, crpisti)
/ | |
(pelin)<-h t e->(odrije?iti, osloboditi)
| |
(pelin)<-e e->(apsolutan, apsolutni kod)

As the user enter letters, you just march down the tree, printing
all the words held in leaf nodes held in the current node.

Call me dense, but how does one do this in Python - which doesn't have
pointers? Dictionaries with dictionaries within dictionaries... (with
each letter as the key and the its children as values) is going to be
extremely space inefficient, right?
How would this be significantly more space inefficient than "pointers"?
The implementation of the dict would, in fact, itself be pointers, where
each key (same memory requirement if using pointers) is mapped to a
pointer to a dict node or a pointer to a leaf, same as with a more "low
level" construction. A printout of the graph as a dict might look a
little ugly, though.

I could be wrong, however.

James
May 10 '07 #6

P: n/a
On 2007-05-10, Gordon Airporte <JH*****@fbi.govwrote:
>
>For the above (abrideged) dictionary, you would generate (use a
fixed-width "programmers" font so the tree looks good):

a
|
b
|
s
/ \
i o
/ / \
n l r
/ / \ \
t u v b->(absorbirati, crpisti)
/ | |
(pelin)<-h t e->(odrije?iti, osloboditi)
| |
(pelin)<-e e->(apsolutan, apsolutni kod)

As the user enter letters, you just march down the tree, printing
all the words held in leaf nodes held in the current node.

Call me dense, but how does one do this in Python - which
doesn't have pointers? Dictionaries with dictionaries within
dictionaries... (with each letter as the key and the its
children as values) is going to be extremely space inefficient,
right?
Unfortunately, I don't know the space tradeoffs in Python
offhand. Lists and tuples make excellent trees.

The above might be stored as follows:

Every node is a tuple of its letter, a list of its children, and
a list of its words. So the two 'pelin' nodes would be (with 'e'
referenced in the 'h' node):

('h', [('e', [], ['pelin'])], ['pelin'])

That would in turn be "stored" in the t, n, i and s nodes.

('s',
[('i',
[('n',
[('t',
[('h', [('e', [], ['pelin'])], ['pelin'])
[])]
[])]
[]), ('o' trie (thanks Terry) omitted for my sanity)])

It's a lot harder to write by hand than it would be to use.

My intuition says it shouldn't be terribly hard on resources for
for a 180K dictionary, but I could be wrong. I'm too lazy to
measure. ;)

If it does turn out to be unreasonably huge, then you'd fall back
on solving the problem completely with a binary search returning
a range (I'm not sure of the name), which would be more expensive
at run time, but might be fast enough, and would use a minimal
amount of 'resources'.

--
Neil Cerutti
May 10 '07 #7

P: n/a

"Neil Cerutti" <ho*****@yahoo.comwrote in message
news:lN*******************@newsreading01.news.tds. net...
| Every node is a tuple of its letter, a list of its children, and
| a list of its words. So the two 'pelin' nodes would be (with 'e'
| referenced in the 'h' node):
|
| ('h', [('e', [], ['pelin'])], ['pelin'])
|
| That would in turn be "stored" in the t, n, i and s nodes.
[snip]

At the outer level, I would use a list in order to build the structure in
pieces, one for each letter, and then add them in.

At the top level, the letters do not need explicit storage. The first
subtree is for words starting with 'a', etc. In other words, the position
of each subtree indicates its starting letter. For most letters, this can
be carried on another level.

tjr

May 11 '07 #8

P: n/a
On May 11, 3:12 am, Neil Cerutti <horp...@yahoo.comwrote:
On 2007-05-10, Gordon Airporte <JHoo...@fbi.govwrote:


For the above (abrideged) dictionary, you would generate (use a
fixed-width "programmers" font so the tree looks good):
a
|
b
|
s
/ \
i o
/ / \
n l r
/ / \ \
t u v b->(absorbirati, crpisti)
/ | |
(pelin)<-h t e->(odrije?iti, osloboditi)
| |
(pelin)<-e e->(apsolutan, apsolutni kod)
As the user enter letters, you just march down the tree, printing
all the words held in leaf nodes held in the current node.
Call me dense, but how does one do this in Python - which
doesn't have pointers? Dictionaries with dictionaries within
dictionaries... (with each letter as the key and the its
children as values) is going to be extremely space inefficient,
right?

Unfortunately, I don't know the space tradeoffs in Python
offhand. Lists and tuples make excellent trees.

The above might be stored as follows:

Every node is a tuple of its letter, a list of its children, and
a list of its words. So the two 'pelin' nodes would be (with 'e'
referenced in the 'h' node):

('h', [('e', [], ['pelin'])], ['pelin'])

That would in turn be "stored" in the t, n, i and s nodes.

('s',
[('i',
[('n',
[('t',
[('h', [('e', [], ['pelin'])], ['pelin'])
[])]
[])]
[]), ('o' trie (thanks Terry) omitted for my sanity)])

It's a lot harder to write by hand than it would be to use.

My intuition says it shouldn't be terribly hard on resources for
for a 180K dictionary, but I could be wrong. I'm too lazy to
measure. ;)

If it does turn out to be unreasonably huge, then you'd fall back
on solving the problem completely with a binary search returning
a range (I'm not sure of the name), which would be more expensive
at run time, but might be fast enough, and would use a minimal
amount of 'resources'.

--
Neil Cerutti
The problem that I see here is that each time user types a letter, all
the leaf nodes in the subtree would have to be traversed again. This
computation was already done for all the letters user typed before the
last one.

My suggestion would be to have two data structures. The first one
similar to the tree mentioned above, but leaf nodes need not have
Croatian translations, and each node should also have total number of
leafs in both left siblings and right siblings of that node. The
second data structure would be list of English:Croatian word pairs.

As the user types a new letter, we just have to remove(not show), the
number of leaf elements in the left and/or right siblings, from left
and/or right of the second data structure. So we just have to keep
track of the node we r currently at(in the tree) and the start, end
index for the second data structure(basically the list to be shown to
the user).

By the way, both data structures could be implemented as tuple in
python, for I suppose, if only lookup is needed tuple gives better
performance than list.

ciju

May 11 '07 #9

P: n/a

On May 10, 2007, at 12:26 PM, Gigs_ wrote:
Hi all!

I have text file (english-croatian dictionary) with words in it in
alphabetical
order.
This file contains 179999 words in this format:
english word: croatian word

I want to make instant search for my gui
Instant search, i mean that my program search words and show words
to user as
user type letters.
yes, it needs to be fast

Can someone give me some points (approaches) how to make this
Should i make indexes and go with file.seek

or should breake dictionary in peaces and put words that start with
a in one and
with b in another...
?????

So if i have this words
...

if user type: "abs" program should list all words above in english
and in croatian
if user type: "absorb" than program should list last 3 words in
english and in
croatian

any help would be appreciate!
my apologies for bad english
Here's an idea: use a rats' nest of dictionaries and do all the
lookup work up front when you build the rats' nest. Maybe something
like this:

#! /usr/bin/env python
import pprint

dictionary = """absinth:pelin
absinthe:pelin
absolute:apsolutan
absolute:apsolutni kod
absolute:apsolutno
absolute:čist
absolute:nesumnjiv
absolute:potpun
absolute:savrsen
absolute coordinates:apsolutne koordinate
absolute frequency:apsolutna učestalost
absolute gap:apsolutni jaz
absolute line spacing:apsolutni međurazmak linija
absolute majority:apsolutna većina
absolute pointing device:apsolutni pokazivački uređaj
absolute quantity:apsolutni udio
absolute value:apsolutna vrijednost
absolute zero:apsolutna nula
absolutely:apsolutno
absolutely:bezuvjetno
absolutely:nezavisno
absolutely:potpuno
absolutely:samostalno
absolutely:sasvim
absolution:odrjesenje
absolution:oprostaj
absolutism:apsolutizam
absolve:odrijesiti
absolve:osloboditi
absorb:absorbirati
absorb:apsorbirati
absorb:crpsti"""

lookup = {'words':{}, 'letters':{}}

for translation in dictionary.split('\n'):
english, croatian = translation.split(':')
if english in lookup['words']:
lookup['words'][english].append(croatian)
else:
lookup['words'][english] = [croatian]

for position, letter in enumerate(english):
if position == 0:
youAreHere = lookup['letters']

if letter not in youAreHere:
youAreHere[letter] = {'words':[]}
youAreHere[letter]['words'].append(lookup['words'][english])
youAreHere = youAreHere[letter]

def tryit(partial):
youAreHere = lookup['letters']
for letter in partial:
youAreHere = youAreHere[letter]
return youAreHere['words']

if __name__ == '__main__':
pprint.pprint(tryit('abso'))
Hope this helps,
Michael
---
The Rules of Optimization are simple.
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
-- Michael A. Jackson , "Principles of
Program Design", 1975.
May 11 '07 #10

P: n/a
>
Call me dense, but how does one do this in Python - which doesn't have
pointers? Dictionaries with dictionaries within dictionaries... (with
each letter as the key and the its children as values) is going to be
extremely space inefficient, right?
Isn't *everything* in python essentially a pointer? Dictionaries
with dictionaries within dictionaries... My gut feeling (which means
I have not measured it, so I don't actually know) is that it would
not be space inefficient. Perhaps someone who knows more about this
will speak up?

May 11 '07 #11

P: n/a
On 2007-05-11, ciju <ma***************@gmail.comwrote:
By the way, both data structures could be implemented as tuple
in python, for I suppose, if only lookup is needed tuple gives
better performance than list.
I used a list instead of a tuple where I thought a list would be
convenient while building the data structure. But you could
convert everything to tuples in the end, it's true.

--
Neil Cerutti
May 11 '07 #12

P: n/a
Michael Bentley <mi*****@jedimindworks.comwrote:

Call me dense, but how does one do this in Python - which doesn't have
pointers? Dictionaries with dictionaries within dictionaries... (with
each letter as the key and the its children as values) is going to be
extremely space inefficient, right?

Isn't *everything* in python essentially a pointer? Dictionaries
with dictionaries within dictionaries... My gut feeling (which means
I have not measured it, so I don't actually know) is that it would
not be space inefficient. Perhaps someone who knows more about this
will speak up?
Dicts are hash tables, and therefore, for performance, always keep some
"extra" space (so that the table won't be too full).
Alex
May 11 '07 #13

P: n/a

On May 11, 2007, at 3:50 AM, Michael Bentley wrote:
>
Here's an idea: use a rats' nest of dictionaries and do all the
lookup work up front when you build the rats' nest. Maybe something
like this:
....

Oops! This is better :-)

#! /usr/bin/env python
import pprint

dictionary = """absinth:pelin
absinthe:pelin
absolute:apsolutan
absolute:apsolutni kod
absolute:apsolutno
absolute:čist
absolute:nesumnjiv
absolute:potpun
absolute:savrsen
absolute coordinates:apsolutne koordinate
absolute frequency:apsolutna učestalost
absolute gap:apsolutni jaz
absolute line spacing:apsolutni međurazmak linija
absolute majority:apsolutna većina
absolute pointing device:apsolutni pokazivački uređaj
absolute quantity:apsolutni udio
absolute value:apsolutna vrijednost
absolute zero:apsolutna nula
absolutely:apsolutno
absolutely:bezuvjetno
absolutely:nezavisno
absolutely:potpuno
absolutely:samostalno
absolutely:sasvim
absolution:odrjesenje
absolution:oprostaj
absolutism:apsolutizam
absolve:odrijesiti
absolve:osloboditi
absorb:absorbirati
absorb:apsorbirati
absorb:crpsti"""

lookup = {'words':{}, 'letters':{}}

for translation in dictionary.split('\n'):
english, croatian = translation.split(':')
if english in lookup['words']:
lookup['words'][english].append(croatian)
else:
lookup['words'][english] = [english, croatian]

for position, letter in enumerate(english):
if position == 0:
youAreHere = lookup['letters']

if letter not in youAreHere:
youAreHere[letter] = {'words':[]}

if lookup['words'][english] not in youAreHere[letter]['words']:
youAreHere[letter]['words'].append(lookup['words'][english])
youAreHere = youAreHere[letter]

def tryit(partial):
youAreHere = lookup['letters']
for letter in partial:
youAreHere = youAreHere[letter]
return youAreHere['words']

if __name__ == '__main__':
pprint.pprint(tryit('abs'))
print '=' * 50
pprint.pprint(tryit('absorb'))

May 12 '07 #14

P: n/a
On May 10, 1:26 pm, Gigs_ <g...@hi.t-com.hrwrote:
Hi all!

I have text file (english-croatian dictionary) with words in it in alphabetical
order.
This file contains 179999 words in this format:
english word: croatian word
Let's assume it's okay to have all the data in memory.
In my experience the very fastest way to do what you
want is to store the strings in a sorted list and use
the binary search library module bisect. I once compared this
with doing something similar with tries and it was
much faster. It's also the most simple way to do it, which
is nice too :).
-- Aaron Watters

===
never eat anything bigger than your head -- kliban
May 12 '07 #15

P: n/a
On 2007-05-11, Terry Reedy <tj*****@udel.eduwrote:
>
"Neil Cerutti" <ho*****@yahoo.comwrote in message
news:lN*******************@newsreading01.news.tds. net...
| Every node is a tuple of its letter, a list of its children, and
| a list of its words. So the two 'pelin' nodes would be (with 'e'
| referenced in the 'h' node):
|
| ('h', [('e', [], ['pelin'])], ['pelin'])
|
| That would in turn be "stored" in the t, n, i and s nodes.
[snip]

At the outer level, I would use a list in order to build the
structure in pieces, one for each letter, and then add them in.

At the top level, the letters do not need explicit storage.
The first subtree is for words starting with 'a', etc. In
other words, the position of each subtree indicates its
starting letter. For most letters, this can be carried on
another level.
Here's a proof of concept prototype of the dictionary searching
algorithm. The dictionary I found to test with has roughly 5,000
words, and I didn't attempt any optimizations at all. I'm
guessing it can't be sped up very much, even though I wrote the
simplest possible implementation. Lookup in the worst case is
O(N), where N is the number of letters in your prefix.

First, a sample of input/output.

lookup: pe
people: la gente[Noun]
pepper: pepe[Noun]
peppercorn: grano di pepe[Noun]
peppermint: menta peperita[Noun]
peppery: pepato, acre, pungente[Adjective]
pepsin: pepsina[Noun]
permutation: permutazione[Noun]
permutations: permutazioni[Noun]
permute: permutare[Verb]
perorate: perorare
perpendicular: perpendicolare[Adjective]
perpendicularly: perpendicolarmente[Adverb]
perpendiculars: perpendicolari[Noun]
perpetrate: perpetrare[Verb]
perpetrated: perpetrato[Verb]
perpetual: perpetuo[Adjective]
petard: petardo[Noun]
petroleum: petrolio[Noun]

Now the source.

# Each node is a tuple of (letter, node list, word list). A word
# list is just a list of strings.
root = ('', [], [])

def insert(node, key, value):
if len(key) == 0:
node[2].append(value)
else:
first = key[0]
rest = key[1:]
for n in node[1]:
if n[0] == first:
insert(n, rest, value)
return
node[1].append((first, [], []))
insert(node, key, value)

def lookup(node, key):
if len(key) == 0:
return node
else:
first = key[0]
rest = key[1:]
for v in node[1]:
if v[0] == first:
return lookup(v, rest)
return None

def word_list(node, word):
for v in node[2]:
print '%s: %s' % (word, v)
for n in node[1]:
word_list(n, word+n[0])
# Italian.txt from http://www.june29.com/IDP/
source = open('Italian.txt', 'r')

# Build tree
for line in source:
line = line.strip()
if line[0] == '#': continue
key, value = line.split('\t')
insert(root, key, value)

# Look up a prefix
x = raw_input("lookup: ")

tree = lookup(root, x)
if tree:
word_list(tree, x)
else:
print "Not found."

--
Neil Cerutti
May 17 '07 #16

This discussion thread is closed

Replies have been disabled for this discussion.