473,320 Members | 2,193 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

B-Tree advice needed

Hello
I am making a syntax highlighter for T-SQL, and I am going to hardcode the
words into it for speed's sake (I will probably have thought up enough new
features for a new version when the next version of SQL comes out).
My main algorithm is going to go through all the words in the RICHEDIT
control, and pass each valid word it finds to a word-checking function, which
I want to operate as fast as possible as it will be called frequently. For
these purposes, I was going to try and have a b-tree based algorithm. My
thinking is that if, say, the words it has to find are (hypothetically) get,
set, and select, then it could be of the form
int getwordtype(word)
{
if((word[0] == 'g') && (word[1] == 'e') && (word[2] == 't'))
return WORDTYPE_GET;
if((word[0] == 's') && (word[1] == 'e'))
{
if((word[2] == 't')) return WORDTYPE_SET;
if((word[3] == 'l') && (word[4] == 'e') && (word[5] == 'c') &&
(word[6] == 't')) return WORDTYPE_SELECT;
}
return WORDTYPE_NONE;
}

would this the fastest method? I do want to implement a b-tree.

Obviously I am not going to type out the whole algorithm with all the words
in SQL, I am going to write a program in C# to generate the .cpp file which
will hold it.

Any tips?

Thanks!

Nov 17 '05 #1
11 1599
Bonj wrote:
Hello
I am making a syntax highlighter for T-SQL, and I am going to
hardcode the words into it for speed's sake (I will probably have
thought up enough new features for a new version when the next
version of SQL comes out).
My main algorithm is going to go through all the words in the RICHEDIT
control, and pass each valid word it finds to a word-checking
function, which I want to operate as fast as possible as it will be
called frequently. For these purposes, I was going to try and have a
b-tree based algorithm. My thinking is that if, say, the words it has
to find are (hypothetically) get, set, and select, then it could be
of the form
int getwordtype(word)
{
if((word[0] == 'g') && (word[1] == 'e') && (word[2] == 't'))
return WORDTYPE_GET;
if((word[0] == 's') && (word[1] == 'e'))
{
if((word[2] == 't')) return WORDTYPE_SET;
if((word[3] == 'l') && (word[4] == 'e') && (word[5] == 'c') &&
(word[6] == 't')) return WORDTYPE_SELECT;
}
return WORDTYPE_NONE;
}

would this the fastest method? I do want to implement a b-tree.

Obviously I am not going to type out the whole algorithm with all the
words in SQL, I am going to write a program in C# to generate the
.cpp file which will hold it.


What you want is a TST - Ternary Search Tree - not a b-tree.

Do some Googling and you'll find numerous references. Here's one that
describes TST algorithms in Java.

http://www.javaworld.com/javaworld/j...6-ternary.html

As for generating inline C++ code to implement the algorithm, you might want
to write it straightforwardly firsrt and time it. It's possible that the
all-inline version won't be faster, and could even be slower due to the
greatly increased code size and rate of branching.

-cd
Nov 17 '05 #2
The conventional structure used for symbol table lookup, is a hash table.
You could use std::hash_map straight out of the box.

Keith MacDonald

"Bonj" <Bo**@discussions.microsoft.com> wrote in message
news:9C**********************************@microsof t.com...
Hello
I am making a syntax highlighter for T-SQL, and I am going to hardcode the
words into it for speed's sake (I will probably have thought up enough new
features for a new version when the next version of SQL comes out).
My main algorithm is going to go through all the words in the RICHEDIT
control, and pass each valid word it finds to a word-checking function,
which
I want to operate as fast as possible as it will be called frequently. For
these purposes, I was going to try and have a b-tree based algorithm. My
thinking is that if, say, the words it has to find are (hypothetically)
get,
set, and select, then it could be of the form
int getwordtype(word)
{
if((word[0] == 'g') && (word[1] == 'e') && (word[2] == 't'))
return WORDTYPE_GET;
if((word[0] == 's') && (word[1] == 'e'))
{
if((word[2] == 't')) return WORDTYPE_SET;
if((word[3] == 'l') && (word[4] == 'e') && (word[5] == 'c') &&
(word[6] == 't')) return WORDTYPE_SELECT;
}
return WORDTYPE_NONE;
}

would this the fastest method? I do want to implement a b-tree.

Obviously I am not going to type out the whole algorithm with all the
words
in SQL, I am going to write a program in C# to generate the .cpp file
which
will hold it.

Any tips?

Thanks!

Nov 17 '05 #3
Keith MacDonald wrote:
The conventional structure used for symbol table lookup, is a hash
table. You could use std::hash_map straight out of the box.


For a fixed set of keys (e.g. keywords in a language), a TST typically gives
far better performance than a hash table. Hash tables are very commonly
used for dynamic symbol tables, however.

-cd
Nov 17 '05 #4
There is a common practice to use "perfect hash" functions for that.
Such hash function

int symbol(const char *keyword)

takes string and returns some unique number.
There is GNU gperf.exe application for different platforms capable to
generate C/C++ code for some given string table.
http://www.gnu.org/software/gperf/gperf.html

Andrew Fedoniouk.
http://terrainformatica.com


Nov 17 '05 #5
I don't want a unique number, I want a number of my own choosing that I
specify (e.g. 1 for keywords, 2 for functions, etc).

Can it do this?

Nov 17 '05 #6
A TST sounds great Carl, but I am slightly confused about 2 issues:
a) Why it is ternary - implying three child nodes per node?
b) How I would rig one up for a certain set of words, such that a certain
group of words (that I define) would give one result, another group would
give another, etc.?

in order to generate a function where I can feed a word in , and get what
type it is, or -1 if it isn't found?
NOT a unique number for each word, like a hash table?

"Carl Daniel [VC++ MVP]" wrote:
Keith MacDonald wrote:
The conventional structure used for symbol table lookup, is a hash
table. You could use std::hash_map straight out of the box.


For a fixed set of keys (e.g. keywords in a language), a TST typically gives
far better performance than a hash table. Hash tables are very commonly
used for dynamic symbol tables, however.

-cd

Nov 17 '05 #7
Bonj wrote:
A TST sounds great Carl, but I am slightly confused about 2 issues:
a) Why it is ternary - implying three child nodes per node?
Yes.
b) How I would rig one up for a certain set of words, such that a
certain group of words (that I define) would give one result, another
group would give another, etc.?


You build a map using a TST to implement the "key" part of the map. In leaf
nodes of the map, you store your "data".

Did you actually look at the link I posted, or do any reading on TSTs?

-cd
Nov 17 '05 #8
> You build a map using a TST to implement the "key" part of the map. In leaf
nodes of the map, you store your "data".

Did you actually look at the link I posted, or do any reading on TSTs?
Yes, I did try to understand it, even though I neither understand, nor
trust, Java.
Although if you look at figure 2 (click on it to enlarge), if you look at
the topmost node, "t", then go down to "o", then down to "o" again, then
right to "t", it holds the "tot" data, whereas having gone from "t" to "o" to
"o" to "t", I would have expected "toot", not "tot". And I don't understand
how some nodes can be each other's parent - I thought that in a tree an arrow
could only be traversed in one direction.

But the most baffling thing for me is how do I get it into a C++ algorithm,
and how does the data get stored? I have over 1,000 words. What does it
*look* like - a huge 'if' statement, a set of classes, function pointers,
templates, what?

Sorry for my dumbness - I will understand it, it's just that if I can't get
the first principles I can't really progress.


-cd

Nov 17 '05 #9
Bonj wrote:
You build a map using a TST to implement the "key" part of the map.
In leaf nodes of the map, you store your "data".

Did you actually look at the link I posted, or do any reading on
TSTs?


Yes, I did try to understand it, even though I neither understand, nor
trust, Java.


Suggestion: use std::map for starters and replace it with a TST
implementation when you've shown that the performance matters (by profiling)
and after you have a more complete picture of what you need to do.

In the meantime, do some more research:

http://www.cs.princeton.edu/~rs/strings/

There's a TST implementation that's buried inside the Boost Spirit library,
but extracting it is probably a bit much. In any case, if you want to look
at it, it's at boost/boost/spirit/symbols/impl/tst.ipp.

-cd
Nov 17 '05 #10
OK, I think I sort of get what this guy's on about
(http://www.ddj.com/documents/s=921/ddj9804a/)

What would the data at each node be?
In my first attempt I thought each successive level deeper down the tree
would be one successive letter into the word.
But this seems to work on the principle that each node IS a *complete* word,
am I right? Does this mean that there would be less nodes than my first
attempt, but at each node there would be a whole word comparison? I was
hoping not to have to use the str* (_tcs*) functions....

I'll tell you what was wrong with my first attempt. I was trying to get it
so the first letter would be the first level of the tree, the second letter
the second level, etc. so that it only had to make one character comparison
and it'd have eliminated 25 26ths of all the possible words. There's then an
incredibly high chance that it would only have to do one (short-circuiting)
if statement to nail the word.
In addition, all the data was embedded into the if statements of the
algorithm, a strategy I would like to keep if possible. However this seems to
load it dynamically at runtime.... is there any way round this?
The only problem was my algorithm-generating algorithm didn't know how to
treat words that were an entire subset of another word, for example "sp_help"
and "sp_help_index" - it only listed the biggest one. So I tried to split it
up into words of defined length, but the algorithm had been going all night
and the .cpp file had grown to over 2MB, so I thought that was a no-brainer.

Any thoughts?

My main issues are the speed of the actual comparisons themselves (i.e.,
what is the minimum then can consist of), and how the data is stored (would
like it to be in the algorithm, failing that in a resource, failing that,
loaded at runtime).

If the data has to be objects, can you do some sort of 'heap dump' so I can
run the code that generates the objects, then dump them to a resource file,
and deserialize them later?

I also need it case insensitive.

Thanks for your help
Nov 17 '05 #11
OK. (Please also see my other post)
I pretty much get how the tree operates, I just need some specifics... as in
my other post.
when you've shown that the performance matters (by profiling)
I have done: of my list of about 1,100 words, my 'first attempt' (see other
post for details of this) took 0.7 seconds to find all the words once,
against 0.9 for a control algorithm where all the strings were bunched
together in one space-delimited string, it would then add a space to the
front and back and just do one _tcsstr search on that. That works, but it is
clearly slow... and that's including the overhead of VB writing to the
database (the algorithm is in a Win32 DLL called by VB).

So I know I need a tree, and I know how it works. But I just need to know
some specifics now, particularly relating to what each specific comparison is
(see my other post).
I think the examples are too theoretical - I understand the theory. I need
an example of what one of the comparisons actually looks like - given that
I'm using std::basic_string<_TCHAR> for the strings. Does this have a <
operator, and if so, is it fast enough? Is this what I'd want to be using?

Cheers
and after you have a more complete picture of what you need to do.

In the meantime, do some more research:

http://www.cs.princeton.edu/~rs/strings/

There's a TST implementation that's buried inside the Boost Spirit library,
but extracting it is probably a bit much. In any case, if you want to look
at it, it's at boost/boost/spirit/symbols/impl/tst.ipp.

-cd

Nov 17 '05 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: James | last post by:
I am currently working on a PHP based website that needs to be able to draw from Oracle, MS SQL Server, MySQL and given time and demand other RDBMS. I took a lot of time and care creating a...
2
by: Xenophobe | last post by:
I have a popup window (required by the client) containing a form and would like to prevent users from accessing it directly. They are instead required to access the page via a hyperlink on another...
5
by: Nick Malik | last post by:
reposting to a wider audience "Nick Malik" <nickmalik@hotmail.nospam.com> wrote in message news:WYONc.203854$XM6.119642@attbi_s53... > My turn to ask a question > > I am working on a plug-in...
3
by: OldTrucksCo | last post by:
Hello friends We have this website www.ramonlapayese.com published on the Net. It does have a lot of javascript in it, starting from the random image displaying in its home page, and...
47
by: ship | last post by:
Hi We need some advice: We are thinking of upgrading our Access database from Access 2000 to Access 2004. How stable is MS Office 2003? (particularly Access 2003). We are just a small...
4
by: Nick Malik | last post by:
My turn to ask a question I am working on a plug-in for Sharepoint that will allow a developer to add workflow rules. One of the rules will inform the adapter that it should load a DLL that the...
2
by: WStoreyII | last post by:
This question is for all of the developers out there. I recently had to drop out of school as it was getting way to expensive and i just could not pay it. I am really upset about this becuase i...
5
by: Steve | last post by:
Hi, I am sitting down to design our next set of internal apps. I would like to approach this in a way that would allow me to break logical parts of the application that handle specific tasks...
28
by: Ian Davies | last post by:
Hello I would appreciate some help from someone who has knowledge of working with css, php javascript and how they interact. Ive been working on a task for the last few days and have started to...
1
by: George | last post by:
My desktop application (C# 2.0) works just fine and makes tons of requests via WebRequest and WebClient objects. However, users that are behind Proxy servers are not able to use the app. I'm sure...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.