473,387 Members | 1,834 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Ternary Search Trees

In making a ternary search tree to identify as fast as possible the type of
word passed in to the algorithm, for instance sp_help -> 1 (procedures),
select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on
deciding what type of comparison happens at each node.
Has anyone got any suggestions?
Do you think for the first level I could compare just the first characters
of the word passed in and the word at the 'current' node, or would it be more
likely to be a whole word comparison for each node?
If it would be a whole word comparison, what form would this take for best
performance?
Bearing in mind that each word will only ever return the same type, but more
than one word will have a certain type (i.e. the type will not be unique, but
the word will.)
Nov 17 '05 #1
7 1924
Hi,
i am assuming you have libraries to do string comparison. say you are in
node "root" and it has 3 children child1, child2, child3. if the word that
you are searching is less than child1(Which you can find using strcmp or any
other string utility functions) then go down in child1. if it is between
child1 and child2(greater than child1 and lesser than child2) then go down in
child2 else go down in child3. if this answered your question well and good!
if not get back with more details of what you are trying to do.
regards,
Lingesh

"Bonj" wrote:
In making a ternary search tree to identify as fast as possible the type of
word passed in to the algorithm, for instance sp_help -> 1 (procedures),
select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on
deciding what type of comparison happens at each node.
Has anyone got any suggestions?
Do you think for the first level I could compare just the first characters
of the word passed in and the word at the 'current' node, or would it be more
likely to be a whole word comparison for each node?
If it would be a whole word comparison, what form would this take for best
performance?
Bearing in mind that each word will only ever return the same type, but more
than one word will have a certain type (i.e. the type will not be unique, but
the word will.)

This posting is provided "AS IS" with no warranties, and confers no rights

Nov 17 '05 #2
I'm trying to build a word-classifying algorithm for a syntax highlighter for
T-SQL.
i am assuming you have libraries to do string comparison.
No, that's what I don't have. The strings will be std::basic_string<_TCHAR>
but that's the only library I am using, but I want the algorithm to be as
fast as possible.
say you are in
node "root" and it has 3 children child1, child2, child3. if the word that
you are searching is less than child1(Which you can find using strcmp or any
other string utility functions) then go down in child1. if it is between
child1 and child2(greater than child1 and lesser than child2) then go down in
child2 else go down in child3. if this answered your question well and good!
if not get back with more details of what you are trying to do.
What I don't get, is what form would "root" take, what form would "child1"
take, and especially what form would "is less than" take, in this
*particular* scenario? I'm hitting brick walls because the only examples I
can find are text-book generic.
For instance, if I was searching for the type of word for sp_help, the
parameter I pass in is a string with the value "sp_help" and I want to get
back the value WT_SYSTEMPROC (where WT_SYSTEMPROC is an application-defined
constant).
Is the "root" node the first letter, "s"? Is the root node instead the very
first (alphabetically) word out of the ones I want to recognise? Can you
describe the process of traversing down the tree for this example of this
specific algorithm?

regards,
Lingesh

"Bonj" wrote:
In making a ternary search tree to identify as fast as possible the type of
word passed in to the algorithm, for instance sp_help -> 1 (procedures),
select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on
deciding what type of comparison happens at each node.
Has anyone got any suggestions?
Do you think for the first level I could compare just the first characters
of the word passed in and the word at the 'current' node, or would it be more
likely to be a whole word comparison for each node?
If it would be a whole word comparison, what form would this take for best
performance?
Bearing in mind that each word will only ever return the same type, but more
than one word will have a certain type (i.e. the type will not be unique, but
the word will.)

This posting is provided "AS IS" with no warranties, and confers no rights

Nov 17 '05 #3
sorry another one, what *form* would "go down" take?
i.e. go from "sp_he" to "sp_hel", one step nearer to the child node of
"sp_help"?
(am I on the right lines?)
What about "sp_helpindex", that would be a child node of "sp_help", but
"sp_help" is a child node in itself. This is one thing that confuses me.
"Lingeshwaran Palaniappan(Microsoft)" wrote:
Hi,
i am assuming you have libraries to do string comparison. say you are in
node "root" and it has 3 children child1, child2, child3. if the word that
you are searching is less than child1(Which you can find using strcmp or any
other string utility functions) then go down in child1. if it is between
child1 and child2(greater than child1 and lesser than child2) then go down in
child2 else go down in child3. if this answered your question well and good!
if not get back with more details of what you are trying to do.
regards,
Lingesh

"Bonj" wrote:
In making a ternary search tree to identify as fast as possible the type of
word passed in to the algorithm, for instance sp_help -> 1 (procedures),
select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on
deciding what type of comparison happens at each node.
Has anyone got any suggestions?
Do you think for the first level I could compare just the first characters
of the word passed in and the word at the 'current' node, or would it be more
likely to be a whole word comparison for each node?
If it would be a whole word comparison, what form would this take for best
performance?
Bearing in mind that each word will only ever return the same type, but more
than one word will have a certain type (i.e. the type will not be unique, but
the word will.)

This posting is provided "AS IS" with no warranties, and confers no rights

Nov 17 '05 #4
Bonj wrote:
In making a ternary search tree to identify as fast as possible the type of
word passed in to the algorithm, for instance sp_help -> 1 (procedures),
select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on
deciding what type of comparison happens at each node.
Has anyone got any suggestions?
First, I would try std::map to see if you really need to go through all
the work of writing and debugging this new data structure. Also,
boost::spirit has a symbol table class that is supposedly based on a
TST, which you might look into using. (Actually, if you're doing
parsing, you should probably look into boost::spirit. It's incredibly
nifty.)
Do you think for the first level I could compare just the first characters
of the word passed in and the word at the 'current' node, or would it be more
likely to be a whole word comparison for each node?
Only the first character. Otherwise you could just have a binary search
tree.
If it would be a whole word comparison, what form would this take for best
performance?
Bearing in mind that each word will only ever return the same type, but more
than one word will have a certain type (i.e. the type will not be unique, but
the word will.)


A TST will basically be a binary search tree on the first character of
the string and when a match is found, the node will contain another TST
that starts with the second character, and so on until you run out of
string or something doesn't match.

So you have a structure something like:

struct tst_node
{
_TCHAR match;
struct tst_node *left, *right;
union
{
struct tst_node *ptr; /* when match != 0 */
unsigned data; /* only used when match == 0 */
} down;
};

struct tst_node *root;

And for your three examples, "sp_help", "select", and "sysobjects", the
top of the TST might look like:

root -> node0

/* first level BST */

node0 =
{
match = 's'
left = right = NULL
ptr -> node1
}

/* second level BST (advance string iterator/pointer/index) */

node1 =
{
match = 'p'
left = node2 /* less than 'p' */
right = node3 /* greater than 'p' */
ptr -> subtree with only "_help"
}

node2 =
{
match = 'e'
left = right = NULL
ptr -> subtree with only "lect"
}

node3 =
{
match = 'y'
left = right = NULL
ptr -> subtree with only "sobjects"
}

Now the subtree with only "sobjects" is going to take at least 16*9 =
144 bytes, which is quite wasteful. If you're sneaky, you can store a
sequence of characters in each node rather than a single character. You
still treat it as having only one character for the purpose of the
binary search, but if that matches, you compare the rest of the
characters in that node and give up if any do not match. But this is
just an optimization, and it makes things a lot more complex. (of
course you are looking at implementing a TST, so maybe this is important
to you)

I hope this is clear enough to be of some use. :)

-josh

Nov 17 '05 #5
Yes, that's great, thanks.
I'll study your answer, but it seems to englighten the situation.
Now the subtree with only "sobjects" is going to take at least 16*9 =
144 bytes, which is quite wasteful. If you're sneaky, you can store a
sequence of characters in each node rather than a single character. You
still treat it as having only one character for the purpose of the
binary search, but if that matches, you compare the rest of the
characters in that node and give up if any do not match. But this is
just an optimization, and it makes things a lot more complex. (of
course you are looking at implementing a TST, so maybe this is important
to you)


That's very good to know, cheers.
Nov 17 '05 #6
Just another question:
One last thing I can't quite get my head round and I'm hoping you'll be able
to help:
What would happen algorithmically in the situation, where a node has got
child nodes, but is also a root node itself? For instance, "sp_help" is a
child node, but also has child nodes relating to, say, "sp_helpindex"
???

Any thoughts, like what would the code basically do?

"josh" wrote:
Bonj wrote:
In making a ternary search tree to identify as fast as possible the type of
word passed in to the algorithm, for instance sp_help -> 1 (procedures),
select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on
deciding what type of comparison happens at each node.
Has anyone got any suggestions?


First, I would try std::map to see if you really need to go through all
the work of writing and debugging this new data structure. Also,
boost::spirit has a symbol table class that is supposedly based on a
TST, which you might look into using. (Actually, if you're doing
parsing, you should probably look into boost::spirit. It's incredibly
nifty.)
Do you think for the first level I could compare just the first characters
of the word passed in and the word at the 'current' node, or would it be more
likely to be a whole word comparison for each node?


Only the first character. Otherwise you could just have a binary search
tree.
If it would be a whole word comparison, what form would this take for best
performance?
Bearing in mind that each word will only ever return the same type, but more
than one word will have a certain type (i.e. the type will not be unique, but
the word will.)


A TST will basically be a binary search tree on the first character of
the string and when a match is found, the node will contain another TST
that starts with the second character, and so on until you run out of
string or something doesn't match.

So you have a structure something like:

struct tst_node
{
_TCHAR match;
struct tst_node *left, *right;
union
{
struct tst_node *ptr; /* when match != 0 */
unsigned data; /* only used when match == 0 */
} down;
};

struct tst_node *root;

And for your three examples, "sp_help", "select", and "sysobjects", the
top of the TST might look like:

root -> node0

/* first level BST */

node0 =
{
match = 's'
left = right = NULL
ptr -> node1
}

/* second level BST (advance string iterator/pointer/index) */

node1 =
{
match = 'p'
left = node2 /* less than 'p' */
right = node3 /* greater than 'p' */
ptr -> subtree with only "_help"
}

node2 =
{
match = 'e'
left = right = NULL
ptr -> subtree with only "lect"
}

node3 =
{
match = 'y'
left = right = NULL
ptr -> subtree with only "sobjects"
}

Now the subtree with only "sobjects" is going to take at least 16*9 =
144 bytes, which is quite wasteful. If you're sneaky, you can store a
sequence of characters in each node rather than a single character. You
still treat it as having only one character for the purpose of the
binary search, but if that matches, you compare the rest of the
characters in that node and give up if any do not match. But this is
just an optimization, and it makes things a lot more complex. (of
course you are looking at implementing a TST, so maybe this is important
to you)

I hope this is clear enough to be of some use. :)

-josh

Nov 17 '05 #7
Bonj wrote:
Just another question:
One last thing I can't quite get my head round and I'm hoping you'll be able
to help:
What would happen algorithmically in the situation, where a node has got
child nodes, but is also a root node itself? For instance, "sp_help" is a
child node, but also has child nodes relating to, say, "sp_helpindex"
???

Any thoughts, like what would the code basically do?


The node at "sp_help" will have an entry with its matching character
equal to 0 and an entry with its maching character equal to 'i'. The
first holdes the data for "sp_help" and the second goes on to match
"sp_helpindex".

If your strings are nul terminated, you are guaranteed that a node that
corresponds to a final symbol will not have children.

-josh

Nov 17 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: exonic | last post by:
I'm reposting this, as everyone went off on the "did this say c++" tangent, when someone thought he was reading a different group. Hope I get better results this time. I need an efficient...
3
by: ptrSriram | last post by:
Can someone help me with an algorithm to merge two binary search trees. One method I thought of was to flatten both the trees into sorted lists(inorder traversal),merge those two sorted lists,...
1
by: maryjones11289 | last post by:
Hi All, I'm trying to write/find code that creates a Ternary Search Tree in Visual Basic (VB6 or .NET). Here's my situation: What I have is an array consisting of 60,000 string elements. All...
0
by: rasmus ekman | last post by:
Hi, I've refreshed my Ternary search tree implementation at http://abc.se/~re/code/tst/ It should now be usable. It compiles cleanly with Comeau's online compiler, g++ 3.4.3, and MSVC 7.1....
10
by: free2cric | last post by:
Hi, I have a single link list which is sorted. structure of which is like typedef struct mylist { int num; struct mylist *next;
4
by: Mountain | last post by:
Background: In trees, there are a two or three ways (maybe more) to find out if Node A is an ancestor of Node B without traversing the tree. Preprossing is OK, and one of the solutions I am...
4
by: boris | last post by:
I have two directories, lib1 and lib2, that both contain the package foo, one with the submodule mod1 and the other with the submodule mod2: $ ls lib*/foo/ lib1/foo/: __init__.py __init__.pyc...
5
by: PerlPhi | last post by:
hi,,, while ago i was wondering why do some programmers rarely uses the ternary operator. wherein it is less typing indeed. i believe in the classic virtue of Perl which is laziness. well let me show...
1
by: mark | last post by:
I have a 5x5 Matrix. Each cell consists a string value(ex:URL of a website) I want to find out which string value occurs( i.e. repeats ) maximum no. of times in the matrix. I have a hint...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.