473,395 Members | 1,637 Online

# Best algorithm/data structure combination for word lookup?

Hi,

I 'm currently writing a program that performs transliteration (i.e.,
converts greek text written using the english alphabet to "pure" greek
text using the greek alphabet) as part of my thesis. I have decided to
add the capability to convert words using some sort of lookup
algorithm as a sidekick to the "normal" conversion algorithm, and here
it starts getting interesting. I want to find an algorithm that
satisfies these requirements:

1. Can support a N-to-1 input words to output words relationship
without duplicating storage (contrived example: "133t" and "1337"
should both somehow point to "leet" without having to store the string
"leet" twice)

2. Has very fast negative answers to lookups, as I don't expect lookup
dictionaries to get big (e.g. 1000 words maximum) and therefore most
of the time lookups will turn out to be negative.

3. Did I mention speed? This algorithm will be used only for small
(e.g. 2-4 letters) words, with which the normal conversion is already
fast. This one will have to be even faster to be useful.

Things that I don't mind trading off:

1. Mutability of the associated data structures. They are going to be
built once at program startup and not modified for the duration of the
run.

My current thoughts are:

* As per requirement #1, we should probably associate strings with
integers. These integers will be used as indexes to another structure
that provides the output words (std::vector most likely).
* As per requirement #2, the fastest way to do the lookup would be
either a hash function (that probably means using std::map) or binary
search (that probably means a sorted std::vector).

So we have a couple of different possible implementations:
Impl. A:
Sorted std::vector<std::pair<std::string, int for lookups with
binary search and to associate each word with an index into another
container
std::vector<std::stringto store the output words
Impl. B:
std::map<std::string, intfor lookups
std::vector<std::stringfor the output words

A sorted std::vector would be faster for lookups than std::map, right?
What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?

What do you suggest? Is there any other good option that I 'm not
considering?

Jon

Jul 27 '07 #1
6 4063
pj@moodle.org wrote:
[..]
A sorted std::vector would be faster for lookups than std::map, right?
If so, then only marginally.
What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?
Could be. Have you tried looking at different approaches in the
TACP by Knuth? Sorting and Searching takes a whole volume.
What do you suggest? Is there any other good option that I 'm not
considering?
Pick one, get it working. Think of an interface where you can
easily plug in different implementations. Then measure each of
the approaches.

V
--
Jul 27 '07 #2
On 2007-07-27 17:59, pj@moodle.org wrote:
Hi,

I 'm currently writing a program that performs transliteration (i.e.,
converts greek text written using the english alphabet to "pure" greek
text using the greek alphabet) as part of my thesis. I have decided to
add the capability to convert words using some sort of lookup
algorithm as a sidekick to the "normal" conversion algorithm, and here
it starts getting interesting. I want to find an algorithm that
satisfies these requirements:

1. Can support a N-to-1 input words to output words relationship
without duplicating storage (contrived example: "133t" and "1337"
should both somehow point to "leet" without having to store the string
"leet" twice)

2. Has very fast negative answers to lookups, as I don't expect lookup
dictionaries to get big (e.g. 1000 words maximum) and therefore most
of the time lookups will turn out to be negative.
When using binary trees/searches the time it takes to find out that the
word is not in the dictionary must be the same as it takes to find that
it isn't, if you think about it you'll see why.
3. Did I mention speed? This algorithm will be used only for small
(e.g. 2-4 letters) words, with which the normal conversion is already
fast. This one will have to be even faster to be useful.
A question, is there much work involved translating letter for letter,
ie. isn't there more or less a 1:1 mapping of the Latin letters to the
Greek? Just curious since you probably wont see any gains if it's that
simple.
Things that I don't mind trading off:

1. Mutability of the associated data structures. They are going to be
built once at program startup and not modified for the duration of the
run.

My current thoughts are:

* As per requirement #1, we should probably associate strings with
integers. These integers will be used as indexes to another structure
that provides the output words (std::vector most likely).
I'd probably use a pointer instead of an int, doubt that there'll be
much difference in speed.
* As per requirement #2, the fastest way to do the lookup would be
either a hash function (that probably means using std::map) or binary
search (that probably means a sorted std::vector).

So we have a couple of different possible implementations:
Impl. A:
Sorted std::vector<std::pair<std::string, int for lookups with
binary search and to associate each word with an index into another
container
std::vector<std::stringto store the output words
Impl. B:
std::map<std::string, intfor lookups
std::vector<std::stringfor the output words

A sorted std::vector would be faster for lookups than std::map, right?
For small sets the vector could probably be faster but I think the map
will be better as the size increase (but I don't have anything to back
that up with, so measure yourself).
What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?
Unless you plan to use the application in an embedded system or handheld
then std::map/std::vector will probably be fast enough for a dictionary
of that size.
What do you suggest? Is there any other good option that I 'm not
considering?
Don't think it will give you much with such small sets of words but a
radix-tree might be a good choice. It's a tree-structure and the depth
of each node would be no more than the number of letters in the word
(might be less). Another advantage is that it can give you early
negatives, you'd only have to walk as deep as the word has letters in
common with a word that's in the dictionary. The Wikipedia article on
the subject will give you the idea: en.wikipedia.org/wiki/Radix_tree
--
Erik Wikström
Jul 27 '07 #3
On Jul 27, 12:59 pm, Erik Wikström <Erik-wikst...@telia.comwrote:
On 2007-07-27 17:59, p...@moodle.org wrote:
[snip]
A sorted std::vector would be faster for lookups than std::map,
right?

For small sets the vector could probably be faster but I think
the map
will be better as the size increase (but I don't have anything to
back
that up with, so measure yourself).
I think you misunderstand. The OP was referring to a sorted vector
that one searches using std::lower_bound or a similar binary search
algorithm. Doing so is typically more efficient than using the find
member function of a std::map, which is a more complex data structure
than a vector. The drawback of using a sorted vector is that
inserting or deleting elements is typically far more time consuming
than doing so in a map. The advantage is that searching to see if an
element exists is typically faster in a sorted vector than in a map.
(Both searches should occur in logarithmic time, but the binary search
of a sorted vector may be done more efficiently.) This should hold
true whether the size of the container is 100 or 10,000, although the
gains may be marginal depending upon the implementation. Scott Meyers
devotes a chapter of Effective STL to this topic.

Best regards,

Tom

Jul 27 '07 #4
For faster look-up you might create a pointer-table at the beginning
of the run time or even at compile time if the word list is fixed.

For the english alphabet this would be an array of 26 indices. First
entry is 0, that's where words with initial letter 'a' start in your
word list. Second entry is the index of the first word with initial
letter b in your word list and so on.

If you have to look up a word with initial letter b then you can start
at the pointer-table entry 1 and stop at pointer entry 2 (excluding).
If you don't find it in this shorter range then it does not exist in
the full list.

With a bit more memory you can also go for the first two letters. That
would make an index list with 26**2 = 676 entries. This should shorten
the look-up time significantly.

Best regards,
Tobias

Jul 28 '07 #5
On Jul 27, 5:59 pm, "p...@moodle.org" <John.Papaioan...@gmail.com>
wrote:
I 'm currently writing a program that performs transliteration (i.e.,
converts greek text written using the english alphabet to "pure" greek
text using the greek alphabet) as part of my thesis. I have decided to
add the capability to convert words using some sort of lookup
algorithm as a sidekick to the "normal" conversion algorithm, and here
it starts getting interesting. I want to find an algorithm that
satisfies these requirements:
1. Can support a N-to-1 input words to output words relationship
without duplicating storage (contrived example: "133t" and "1337"
should both somehow point to "leet" without having to store the string
"leet" twice)
The alternative to storing the string "leet" twice is to use
pointers. And it's far from obvious that a pointer requires
less memory than the string---on my machine, a pointer is 8
bytes, where as I can store "leet" in five bytes.
2. Has very fast negative answers to lookups, as I don't expect lookup
dictionaries to get big (e.g. 1000 words maximum) and therefore most
of the time lookups will turn out to be negative.
For 1000 or so words, std::set is more than adequate. For
100000, of course, you'd have to consider some sort of hashing.

You might also consider some sort of tri.
3. Did I mention speed? This algorithm will be used only for small
(e.g. 2-4 letters) words, with which the normal conversion is already
fast. This one will have to be even faster to be useful.
Things that I don't mind trading off:
1. Mutability of the associated data structures. They are going to be
built once at program startup and not modified for the duration of the
run.
My current thoughts are:
* As per requirement #1, we should probably associate strings with
integers. These integers will be used as indexes to another structure
that provides the output words (std::vector most likely).
Which would make it significantly slower, without necessarily
reducting memory very much.
* As per requirement #2, the fastest way to do the lookup would be
either a hash function (that probably means using std::map) or binary
search (that probably means a sorted std::vector).
If the data is unmutable, and can be known at compile time, the
simplest solution is probably std::lower_bound over a C style
array, statically initialized; if you can set a maximum length
of, say 7, for the strings, this can avoid all pointers
altogether, and be very, very fast. (It also have very good
locality, which can make a significant difference in real life,
even if it doesn't affect the big-O factor.)

Otherwise, std::map (normally a balanced tree, and never a hash
table) can be used.
So we have a couple of different possible implementations:
Impl. A:
Sorted std::vector<std::pair<std::string, int for lookups with
binary search and to associate each word with an index into another
container
std::vector<std::stringto store the output words
Impl. B:
std::map<std::string, intfor lookups
std::vector<std::stringfor the output words
A sorted std::vector would be faster for lookups than std::map, right?
What do the measurements say on your machine? As I said, if the
data were known at compile time, I'd use a:

struct Entry
{
char from[ 8 ] ;
char to [ 8 ] ;

bool operator<( Entry const& other ) const
{
return memcmp( from, other.from, 8 ) < 0 ;
}
} ;

Entry const table[] =
{
// Entries pre-sorted.
} ;

and std::lower_bound. A bit brutal, perhaps, but remarkably
efficient, because locality is as good as possible.
What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?
It depends on the efficiency of the hash function. There can be
some gain for as few as a couple of hundred elements, but in
practice, the difference is very, very small, and a hash table
will never has as good locality as something like the above.

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jul 29 '07 #6
On Jul 28, 12:15 am, Thomas Tutone <Thomas8675...@yahoo.comwrote:
On Jul 27, 12:59 pm, Erik Wikström <Erik-wikst...@telia.comwrote:
On 2007-07-27 17:59, p...@moodle.org wrote:
[snip]
A sorted std::vector would be faster for lookups than std::map,
right?
For small sets the vector could probably be faster but I
think the map will be better as the size increase (but I
don't have anything to back that up with, so measure
yourself).
I think you misunderstand. The OP was referring to a sorted vector
that one searches using std::lower_bound or a similar binary search
algorithm. Doing so is typically more efficient than using the find
member function of a std::map, which is a more complex data structure
than a vector.
I suspect that the real difference comes from the fact that
vector has better locality; std::map is (always?) a node based
structure, which means each entry is allocated separately, and
could be anywhere with regards to the others. And even if it
happens to be adjacent, you still have the extra pointers, etc.,
which means that you get fewer entries in a cache line or a
page.

Note that similar considerations may concern the use of
std::string, unless the implementation uses the small string
optimization.

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jul 29 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

### Similar topics

 13 by: usgog | last post by: I need to implement a function to return True/false whether String A contains String B. For example, String A = "This is a test"; String B = "is is". So it will return TRUE if String A includes two... 3 by: gdogg1587 | last post by: Greetings. I'm working on a program that will "descramble" words. Think of a word scramble game where there is a list of characters, and several blank spaces to denote the word(s) that you are... 136 by: Matt Kruse | last post by: http://www.JavascriptToolbox.com/bestpractices/ I started writing this up as a guide for some people who were looking for general tips on how to do things the 'right way' with Javascript. Their... 34 by: Jeff | last post by: For years I have been using VBA extensively for updating data to tables after processing. By this I mean if I had to do some intensive processing that resulted in data in temp tables, I would have... 6 by: James | last post by: I am using vb.net and need to keep in memory a large data structure, so I am looking for the best option. And after several test I am pretty confused. So I will be grateful if anyone can help me. ... 6 by: aarklon | last post by: Hi folks, I found an algorithm for calculating the day of the week here:- http://www.faqs.org/faqs/calendars/faq/part1/index.html in the section titled 2.5 what day of the week was 2 august... 1 by: Bill Mill | last post by: Hello all, What data structure would you use to implement something analogous to the iTunes search? I imagine that it must be a tree of some sort, but I can't figure out an easy structure for... 4 by: LoneHunter01 | last post by: Basically, I just need a general direction on where to go for this. Yes, this is for a school project, though it's strictly an optional one (and I have tried many solutions, one in-depth). We've... 15 by: Gigs_ | last post by: Hi all! I have text file (english-croatian dictionary) with words in it in alphabetical order. This file contains 179999 words in this format: english word: croatian word I want to make... 0 by: Charles Arthur | last post by: How do i turn on java script on a villaon, callus and itel keypad mobile phone 0 by: ryjfgjl | last post by: If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming... 0 by: emmanuelkatto | last post by: Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel 1 by: nemocccc | last post by: hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions? 1 by: Sonnysonu | last post by: This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to... 0 by: Hystou | last post by: There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2... 0 by: marktang | last post by: ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,... 0 by: Hystou | last post by: Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can... 0 by: tracyyun | last post by: Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...