473,396 Members | 1,827 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

C-style Syntax Highlighting Tutorial

Everyone

I have been spending weeks looking on the web for a good tutorial on how to use regular expressions and other methods to satisfy my craving for learning how to do FAST c-style syntax highlighting in C# but I have yet to find anything useful

I know there are people at MS that know this stuff like the front of their hand and I know there are many people out on the web that are proficient in doing this as well but it seems nobody would like to teach this knowledge

I understand that this is not an easy subject but does anybody know of any web tutorials (free or requires payment)
books, or anybody who has written a complete tutorial that demonstrates, from start to finish, how to implement c-style syntax highlighting? I am talking about multiline comments, strings, keywords.. all of the fun stuff including not recognizing a single double-quote in a multi-line /**/ comment as the start of a string, etc..

Any help would be greatly appreciated

Nov 15 '05 #1
4 7574
I'm going to answer your question in a round-about way.

There are a couple of steps to building a compiler. The first is lexical
analysis, figuring out what items are part of a token string and creating a
string of tokens. For example, in the expression A = b * 2; the character
'=' is a single token, as is the character '*', but in the expression A *=
b; the two-character token of '*=' occurs, using the same characters. This
requires a set of interpretation rules, that must be read and understood in
a particular order.

The next step is parsing. This means interpreting the code into an ordered
series of expressions and structures. Parsing yeilds a semantic tree: a
memory structure that represents the code "as it is".

In compiler development, this semantic tree is traversed by the code
generator to generate the initial object code. That object code is then
processed in repeated passes to: optimize, link, reduce, and collect
together other resources and necessary elements (like initialized memory
header blocks and registers settings).

So, when you are asking about syntax highlighting... why did I go into
compiler theory? Because the first two steps are nearly identical.

To do syntax highlighting, you have to perform the lexical analysis and the
parsing to create a semantic tree. However, your semantic tree has to be a
little more forgiving than a typical compiler would allow, because if you
are doing this to create an add-in to a text editor, then the code is being
dynamically written, so things like variables without a declaration, and
uncompleted quoted strings cannot cause your parsing to wander off the
mathematical deep end. Also, in syntax highlighting, you care about
retaining comments, when in a compiler, comments are immediately discarded.

Also, your investigation into syntax highlighting will probably need to be
able to detect methods from Framework objects (like knowing that the
expression
return sbStuff.ToString();
involves a keyword (return), a variable (sbStuff) of a particular type, and
a method on that type (ToString) which can take many forms, one of which is
the form shown (no parameters). This will require the ability to reflect
through the .NET framework in an efficient manner, something that is beyond
my experience to help you with.

In order to do this, you will have to get your parsing semantic tree and
"decorate" it with indicators that illustrate the "classification" of the
token... in other words, do you believe the token to be: a reserved word, a
constant, a method or property call, an operator (like the '.' above), a
line terminator (the ';'), etc.

Then with your decorated semantic tree, you can examine your code segment in
your highlighting area and determine what color or highlight to apply to
each object, based upon the decoration applied to the object in your
semantic tree.

Now you know why no one wanted to answer you.

This is a very brief description of one of the more difficult college
courses I had: compiler theory and the implementation of Finite State
Automata. I loved it (excellent professors... good school... go Vols!).

So if you want to learn how to do what you are doing, you will need to
become pretty good at lexical analysis (not the hardest topic, but something
that does require a good bit of math), and the needed data structures to
create a working parser. A result of this nature would have been way beyond
the time-frames that a college course would typically expect (in other
words, while most of the folks who pass this course could probably create
the lex and parse steps, there's no way that there would have been time, in
a three-and-a-half month semester, for the students to learn the material,
complete the assignment, and have the professor grade 20 submissions!

There are some tools that can help. A long time ago, the researchers at
bell labs put out two nice utilities: lexx and yacc (the generic lexical
analyzer, and 'yet another compiler compiler'). From their inspiration on
unix, hundreds of utilities have been written over the years to do similar
things. The input to tools like this is your syntax, written in a language
called BNF (or Bachus Naur Form... I may have misspelled that, but I should
be close). This is a method for expressing the lexical rules that drive
both the lexical analysis and parsing. You may be able to take a
"syntax-highlighting" text editor, which does all this for you, and simply
supply the BNF for C# and a component for reflecting on the framework...
that would be nice. Take a look at SourceForge or GotDotNet for some ideas.

If you can find one of these utilities, that would be a good starting point
for developing your highlighting parser. There may be online courses and
tutorial on lexical analysis and language parsing... I do not know. You
will probably need a tutorial on BNF as well.

Of couse, you have to decide, right now, if you want to go this deep. If
you do, many folks here will encourage you (myself included).

Good Luck,
--- Nick Malik
Solutions Architect

P.S. Regular expressions are NOT going to do this for you. Set that notion
aside. You can use Regex for some simple lexical analysis, that's it.
Parsing cannot be reasonably done (and debugged) with regex.
"Bob hotmail.com>" <goodoldave@<spamkill> wrote in message
news:9E**********************************@microsof t.com...
Everyone,

I have been spending weeks looking on the web for a good tutorial on how to use regular expressions and other methods to satisfy my craving for
learning how to do FAST c-style syntax highlighting in C# but I have yet to
find anything useful.
I know there are people at MS that know this stuff like the front of their hand and I know there are many people out on the web that are proficient in
doing this as well but it seems nobody would like to teach this knowledge.
I understand that this is not an easy subject but does anybody know of any web tutorials (free or requires payment), books, or anybody who has written a complete tutorial that demonstrates, from start to finish, how to implement c-style syntax highlighting? I am
talking about multiline comments, strings, keywords.. all of the fun stuff
including not recognizing a single double-quote in a multi-line /**/
comment as the start of a string, etc...
Any help would be greatly appreciated!

Nov 15 '05 #2
Bob
Nick

Thank you very much for the answer. However, I think I was a little broad in what I was asking for. I do, eventually, want to get into the nuts and bolts of compilers and lexical analysis but I need to start a little lighter. Let me explain where I am and why I asked the question

I am simply trying to highlight a simplistic language which only has keywords, multiline comments and strings (something like TSQL). I have already created a working syntax highlighter but the problem is that it is very slow. I have created the lexical analyzer (though it is clunky) and created a string of tokens. My problem lies in the fact that it is extremely slow. It currently takes about 17 seconds to parse about 35 printed pages of code. I am under the impression that Regular expressions can make this process extremely faster but, as you stated, I am probably wrong. I asked about doing the highlighting by using regular expressions because, in my understanding, the main purpose of regular expressions in programming is to be able to scan large amounts of text and make matches/replacing etc..

What would you say to how I should go about learning how to make the syntax highlighting of something as simplistic as what I described "fast as lightning"?
Nov 15 '05 #3
Hello,

there are a couple of open source IDEs which offer the things you're looking
for. Maybe you should check them out to see how they do it.

For C# : #develop (http://www.icsharpcode.net/OpenSource/SD/Default.aspx)
For C/C++ : CodeMax (somewhere on the yahoo groups there is an open source
version)

I would think that a lot of the IDEs aren't keep track of everything but
only what's visible and just color whatever is inside the client area. So
they probably don't color 35 pages at once.
One other thing is that the RegEx implementation in .Net is very slow
compared to any other language.

Yves

"Bob" <an*******@discussions.microsoft.com> schreef in bericht
news:66**********************************@microsof t.com...
Nick,

Thank you very much for the answer. However, I think I was a little broad in what I was asking for. I do, eventually, want to get into the nuts and
bolts of compilers and lexical analysis but I need to start a little
lighter. Let me explain where I am and why I asked the question.
I am simply trying to highlight a simplistic language which only has keywords, multiline comments and strings (something like TSQL). I have
already created a working syntax highlighter but the problem is that it is
very slow. I have created the lexical analyzer (though it is clunky) and
created a string of tokens. My problem lies in the fact that it is extremely
slow. It currently takes about 17 seconds to parse about 35 printed pages of
code. I am under the impression that Regular expressions can make this
process extremely faster but, as you stated, I am probably wrong. I asked
about doing the highlighting by using regular expressions because, in my
understanding, the main purpose of regular expressions in programming is to
be able to scan large amounts of text and make matches/replacing etc...
What would you say to how I should go about learning how to make the

syntax highlighting of something as simplistic as what I described "fast as
lightning"?
Nov 15 '05 #4
Bob
Thanks for the links and the .NET RegEx info. I will check it out and see what I can muster out of all this. (only 10MB for the source) :)
Nov 15 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Jay Davis | last post by:
I use xemacs and syntax highlighting for my IDE, but I'm not a big fan of the default syntax highlighting colors. For instance, in 'def fun():' both 'def' and 'fun' are the same color. I kind of...
1
by: Nic | last post by:
Hi - I am battling to find the a resource, so maybe some one in here can help The problem is as follows: I am trying to build a mod_perl source code editor for the web - to edit Perl source code...
4
by: Patrick Porter | last post by:
Arrrgh! I have tried everything (ok, not EVERYTHING) but i cant get solve the problem of getting syntax highlighting in a rich textbox. in the code below, im attempting to highlight all of the...
4
by: frikker | last post by:
Hello, I have an idea for a project which involves an editor that supports syntax highlighting. This would be for any language, particularly php, html, css, etc. I would like to write this...
0
by: bsodano | last post by:
Wondering if anyone knew how to have the ability to have syntax highlighting and Intellisense dropdowns for classic ASP 3.0 pages from within Visual Studio 2005. The catch is, we are not allowed to...
2
by: rockstar_ | last post by:
Hello all- I'm developing a Content Management software for my own site, and possibly package and deploy to other sites (for friends, family, etc.) The content management software is combined...
11
by: Christoph Burschka | last post by:
Are there any free PHP libraries/utility functions that can color the syntax of various programming languages (eg. Java and C++)? I am posting code snippets on my site and would like to know if...
0
by: Scott | last post by:
Hi, I think I may have finally found a IDE/text editor that I like, but, it still haves one problem. Geany haves syntax highlighting, but it is not very good for Python. It only seems to have...
4
by: Rob Stevens | last post by:
Does anyone have any samples on how to do syntax highlighting? I want to write a small program that will display sources like c#, c++ etc. But when I load the file to store it in a db, I would...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.