I'm going to answer your question in a round-about way.
There are a couple of steps to building a compiler. The first is lexical
analysis, figuring out what items are part of a token string and creating a
string of tokens. For example, in the expression A = b * 2; the character
'=' is a single token, as is the character '*', but in the expression A *=
b; the two-character token of '*=' occurs, using the same characters. This
requires a set of interpretation rules, that must be read and understood in
a particular order.
The next step is parsing. This means interpreting the code into an ordered
series of expressions and structures. Parsing yeilds a semantic tree: a
memory structure that represents the code "as it is".
In compiler development, this semantic tree is traversed by the code
generator to generate the initial object code. That object code is then
processed in repeated passes to: optimize, link, reduce, and collect
together other resources and necessary elements (like initialized memory
header blocks and registers settings).
So, when you are asking about syntax highlighting... why did I go into
compiler theory? Because the first two steps are nearly identical.
To do syntax highlighting, you have to perform the lexical analysis and the
parsing to create a semantic tree. However, your semantic tree has to be a
little more forgiving than a typical compiler would allow, because if you
are doing this to create an add-in to a text editor, then the code is being
dynamically written, so things like variables without a declaration, and
uncompleted quoted strings cannot cause your parsing to wander off the
mathematical deep end. Also, in syntax highlighting, you care about
retaining comments, when in a compiler, comments are immediately discarded.
Also, your investigation into syntax highlighting will probably need to be
able to detect methods from Framework objects (like knowing that the
expression
return sbStuff.ToString();
involves a keyword (return), a variable (sbStuff) of a particular type, and
a method on that type (ToString) which can take many forms, one of which is
the form shown (no parameters). This will require the ability to reflect
through the .NET framework in an efficient manner, something that is beyond
my experience to help you with.
In order to do this, you will have to get your parsing semantic tree and
"decorate" it with indicators that illustrate the "classification" of the
token... in other words, do you believe the token to be: a reserved word, a
constant, a method or property call, an operator (like the '.' above), a
line terminator (the ';'), etc.
Then with your decorated semantic tree, you can examine your code segment in
your highlighting area and determine what color or highlight to apply to
each object, based upon the decoration applied to the object in your
semantic tree.
Now you know why no one wanted to answer you.
This is a very brief description of one of the more difficult college
courses I had: compiler theory and the implementation of Finite State
Automata. I loved it (excellent professors... good school... go Vols!).
So if you want to learn how to do what you are doing, you will need to
become pretty good at lexical analysis (not the hardest topic, but something
that does require a good bit of math), and the needed data structures to
create a working parser. A result of this nature would have been way beyond
the time-frames that a college course would typically expect (in other
words, while most of the folks who pass this course could probably create
the lex and parse steps, there's no way that there would have been time, in
a three-and-a-half month semester, for the students to learn the material,
complete the assignment, and have the professor grade 20 submissions!
There are some tools that can help. A long time ago, the researchers at
bell labs put out two nice utilities: lexx and yacc (the generic lexical
analyzer, and 'yet another compiler compiler'). From their inspiration on
unix, hundreds of utilities have been written over the years to do similar
things. The input to tools like this is your syntax, written in a language
called BNF (or Bachus Naur Form... I may have misspelled that, but I should
be close). This is a method for expressing the lexical rules that drive
both the lexical analysis and parsing. You may be able to take a
"syntax-highlighting" text editor, which does all this for you, and simply
supply the BNF for C# and a component for reflecting on the framework...
that would be nice. Take a look at SourceForge or GotDotNet for some ideas.
If you can find one of these utilities, that would be a good starting point
for developing your highlighting parser. There may be online courses and
tutorial on lexical analysis and language parsing... I do not know. You
will probably need a tutorial on BNF as well.
Of couse, you have to decide, right now, if you want to go this deep. If
you do, many folks here will encourage you (myself included).
Good Luck,
--- Nick Malik
Solutions Architect
P.S. Regular expressions are NOT going to do this for you. Set that notion
aside. You can use Regex for some simple lexical analysis, that's it.
Parsing cannot be reasonably done (and debugged) with regex.
"Bob hotmail.com>" <goodoldave@<spamkill> wrote in message
news:9E**********************************@microsof t.com...
Everyone,
I have been spending weeks looking on the web for a good tutorial on how
to use regular expressions and other methods to satisfy my craving for
learning how to do FAST c-style syntax highlighting in C# but I have yet to
find anything useful.
I know there are people at MS that know this stuff like the front of their
hand and I know there are many people out on the web that are proficient in
doing this as well but it seems nobody would like to teach this knowledge.
I understand that this is not an easy subject but does anybody know of any
web tutorials (free or requires payment), books, or anybody who has written a complete tutorial that demonstrates,
from start to finish, how to implement c-style syntax highlighting? I am
talking about multiline comments, strings, keywords.. all of the fun stuff
including not recognizing a single double-quote in a multi-line /**/
comment as the start of a string, etc...
Any help would be greatly appreciated!