473,842 Members | 1,924 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

regular expression question

Hi,

i'm using the regular expression \b\w to find the beginning of a word,
in my C# application. If the word is 'public', for example, it works.
However, if the word is '<public', it does not work: it seems that <
is not a valid character, so the beginning of the word starts at
theletter 'p' instead of '<'.

Because I'm not an expert in regular expressions, maybe someone of you
guys can help me? I need the correct regex to find the beginning of
the word '<public' in a string.

Thanks...

Kind regards,
Ludwig
Mar 25 '06 #1
6 2295
Hi Ludwig,

It is not possible to answer your question as you've stated it. Here's why:
i'm using the regular expression \b\w to find the beginning of a word,
in my C# application. If the word is 'public', for example, it works.
However, if the word is '<public', it does not work: it seems that <
is not a valid character, so the beginning of the word starts at
theletter 'p' instead of '<'.
You have not defined your terms. You use the word "word," but you have not
defined what that is supposed to mean in your situation. In regular
expressions, there are no words, only characters. The "\w" character class
indicates a word *character*. A word character is defined in regular
expressions as a character that is either a digit or a letter of the
alphabet.

So, the character '<' is not defined in regular expressions as a word
character, and therefore is not identified as belonging to the set defined
by your rule.

However, while you have stated that you *do* want to identify the character
'<' as the "beginning of a word," you have not stated exactly what the rule
is, only a small part of it. For example, by what you've told me, the
following character sequences could all be "words" -

Hello Ludwig ('H', 'L') The first letters of each word are identified.

Hello, <Ludwig> ('H', '<') The first letter of "Hello" and the beginning '<'
are identified.

Hello, !!!!!!! ('H', '!') The first letter of "Hello" and the beginning '!'
are identified. This is possible because you have not stated what characters
you do *not* consider to be the beginnings of words.

And so on. In other words, a regular expression is shorthand for a rule that
defines a pattern. You need to explicitly define what the rule is in order
for me to create a regular expression that satisfies that rule.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Show me your certification without works,
and I'll show my certification
*by* my works.

"Ludwig" <no**@none.co m> wrote in message
news:v6******** *************** *********@4ax.c om... Hi,

i'm using the regular expression \b\w to find the beginning of a word,
in my C# application. If the word is 'public', for example, it works.
However, if the word is '<public', it does not work: it seems that <
is not a valid character, so the beginning of the word starts at
theletter 'p' instead of '<'.

Because I'm not an expert in regular expressions, maybe someone of you
guys can help me? I need the correct regex to find the beginning of
the word '<public' in a string.

Thanks...

Kind regards,
Ludwig

Mar 25 '06 #2
On Sat, 25 Mar 2006 09:46:08 -0500, "Kevin Spencer"
<ke***@DIESPAMM ERSDIEtakempis. com> wrote:
Hi Ludwig,

It is not possible to answer your question as you've stated it. Here's why:
i'm using the regular expression \b\w to find the beginning of a word,
in my C# application. If the word is 'public', for example, it works.
However, if the word is '<public', it does not work: it seems that <
is not a valid character, so the beginning of the word starts at
theletter 'p' instead of '<'.


You have not defined your terms. You use the word "word," but you have not
defined what that is supposed to mean in your situation. In regular
expressions, there are no words, only characters. The "\w" character class
indicates a word *character*. A word character is defined in regular
expressions as a character that is either a digit or a letter of the
alphabet.

So, the character '<' is not defined in regular expressions as a word
character, and therefore is not identified as belonging to the set defined
by your rule.

However, while you have stated that you *do* want to identify the character
'<' as the "beginning of a word," you have not stated exactly what the rule
is, only a small part of it. For example, by what you've told me, the
following character sequences could all be "words" -

Hello Ludwig ('H', 'L') The first letters of each word are identified.

Hello, <Ludwig> ('H', '<') The first letter of "Hello" and the beginning '<'
are identified.

Hello, !!!!!!! ('H', '!') The first letter of "Hello" and the beginning '!'
are identified. This is possible because you have not stated what characters
you do *not* consider to be the beginnings of words.

And so on. In other words, a regular expression is shorthand for a rule that
defines a pattern. You need to explicitly define what the rule is in order
for me to create a regular expression that satisfies that rule.


Thanks for the explaination, Kevin!

Well, I'm working on a editor control that supports syntax
highlighting. I have al list of words that should be highlighted when
typed in the editor, for example 'public', 'class', etc.

So at a given time, the user types in the word public, and when the
character 'c' is typed, the word 'public' is colored in blue, for
example.

At the moment, I use the pattern '\b\w' to identify the first
character of the 'word' in the editor, and I use '\w\b' to identify
the last character of a word. This works.

However, there are also xml tags that need to be highlighted; for
example, <sometagname> : if the user types in the '<', it should be
colored; if he then types the last 'e' of 'sometagname', the word
'sometagname' should be colored, if he then types '>', that too should
be colored.

So in fact, each word or character that I define in the list of words,
should be colored.

This list of words can be (for example): public, class, int, long,
byte, byte[], <, >, sometagname, generic<>, etc....

So, if I try to define the rule:
- spaces always define the beginning and end of a word:
public class Test() -> I need to identify the public, class, Test()
- there are characters that are not seperated by spaces but that also
have to be found when typed:
<?xml version="1.0" encoding="utf-8" ?> -> I need to identify the <,
?, xml, version, encoding in order to highlight these in various
colors.

I hope that you understand what I'm trying to do here...

Kind regards,
Ludwig
Mar 25 '06 #3
Hi Ludwig,

I can understand that you're trying to implement syntax highlighting in
your application.

What I don't understand is how you can use \b\w to catch a particular
word in your list. \b matches a word boundary, so that's ok, but \w
will match any word like character.

So to catch 'public' , if you used "\b\wpublic ", it would not match,
since the 'p' of public has already been matched by '\w'. Even tried it
with positive Lookbehind, but still doesn't work.

Unless the words in your list are something like : "ublic", "lass",
"yte"

Kindly clarify,

Regards,

Cerebrus.

Mar 26 '06 #4
Hi Ludwig,

We're getting closer, but remember that close only counts in horseshoes and
hand-grenades, not in programming!

To use Regular Expressions, you must be *absolutely specific* about your
rules.
Well, I'm working on a editor control that supports syntax
highlighting. I have al list of words that should be highlighted when
typed in the editor, for example 'public', 'class', etc.

So at a given time, the user types in the word public, and when the
character 'c' is typed, the word 'public' is colored in blue, for
example.
Let me explain what is missing here. "Syntax" means nothing to Regular
Expressions, and very little to humans. That is, it can refer to so many
different things (such as the "syntax" I'm using to write this post) that it
identifies nothing in and of itself.
I have al list of words that should be highlighted when
typed in the editor
That is what you think you mean, but that is not what you mean. For example,
note the 2 uses of "public" in the following example:

public string Opened()
{
return "Open to the public";
}

Now, the first instance of "public" is syntax, but the second is part of a
string. In other words, "syntax" is a set of rules. From Dictionary.com,
"syntax" means:

"The rules governing the formation of statements in a programming language."

Obviously, "public" as part of a string is not syntax. How do you expect to
tell the Regular expression the difference? You must know the exact syntax
rules, and be able to express them in Regular Expression syntax.
At the moment, I use the pattern '\b\w' to identify the first
Of course, this is unsuitable. The "\b" expression indicates the beginning
or ending or a word, that is a set of characters that is composed entirely
of word characters, and as I said before, '<' is not a word character.
However, there are also xml tags that need to be highlighted; for
example, <sometagname> : if the user types in the '<', it should be
colored; if he then types the last 'e' of 'sometagname', the word
'sometagname' should be colored, if he then types '>', that too should
be colored.
Okay, now you've introduced the topic of XML, which was not part of the
topic in your earlier message, nor up until this point in your current post.
Yet, you have not stated what you mean by "syntax highlighting," nor what
this "syntax" is for. I could assume that you mean "XML syntax" but you have
not said so, so I cannot logically make that assumption. The string you're
parsing may only *contain* XML, as well as other "syntax."
So in fact, each word or character that I define in the list of words,
should be colored.
Not necessarily. See my example (about "public") above. You need to be
*absolutely specific*.
- spaces always define the beginning and end of a word:
Are you certain of this? What about line breaks? Might any of these "words"
be at the beginning or end of the string? If so, they will either not be
preceded by a space nor followed by one.
<?xml version="1.0" encoding="utf-8" ?> -> I need to identify the <,
?, xml, version, encoding in order to highlight these in various
colors.
Okay, see, now you want to identify the '?' in an XML tag. But that is not a
word character, nor is it delimited from "xml" by a space. Again, the syntax
of the Regular Expression depends upon an *absolutely specific* description
of the rules for matching and grouping.
I hope that you understand what I'm trying to do here...


Not yet, but I hope to!

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

Show me your certification without works,
and I'll show my certification
*by* my works.
Mar 26 '06 #5
On Sun, 26 Mar 2006 12:13:08 -0500, "Kevin Spencer"
<ke***@DIESPAMM ERSDIEtakempis. com> wrote:
Hi Ludwig,

We're getting closer, but remember that close only counts in horseshoes and
hand-grenades, not in programming!

To use Regular Expressions, you must be *absolutely specific* about your
rules.
Let me explain what is missing here. "Syntax" means nothing to Regular
Expressions, and very little to humans. That is, it can refer to so many
different things (such as the "syntax" I'm using to write this post) that it
identifies nothing in and of itself.
That is what you think you mean, but that is not what you mean. For example,
note the 2 uses of "public" in the following example:

public string Opened()
{
return "Open to the public";
}

Now, the first instance of "public" is syntax, but the second is part of a
string. In other words, "syntax" is a set of rules. From Dictionary.com,
"syntax" means:

"The rules governing the formation of statements in a programming language."

Obviously, "public" as part of a string is not syntax. How do you expect to
tell the Regular expression the difference? You must know the exact syntax
rules, and be able to express them in Regular Expression syntax.

Of course, this is unsuitable. The "\b" expression indicates the beginning
or ending or a word, that is a set of characters that is composed entirely
of word characters, and as I said before, '<' is not a word character. Okay, now you've introduced the topic of XML, which was not part of the
topic in your earlier message, nor up until this point in your current post.
Yet, you have not stated what you mean by "syntax highlighting," nor what
this "syntax" is for. I could assume that you mean "XML syntax" but you have
not said so, so I cannot logically make that assumption. The string you're
parsing may only *contain* XML, as well as other "syntax."

Not necessarily. See my example (about "public") above. You need to be
*absolutely specific*.

Are you certain of this? What about line breaks? Might any of these "words"
be at the beginning or end of the string? If so, they will either not be
preceded by a space nor followed by one.

Okay, see, now you want to identify the '?' in an XML tag. But that is not a
word character, nor is it delimited from "xml" by a space. Again, the syntax
of the Regular Expression depends upon an *absolutely specific* description
of the rules for matching and grouping.
I hope that you understand what I'm trying to do here...


Not yet, but I hope to!


Hi Kevin,

thanks again! Okay, seems like I need to explain further what I need
:) For my application, I need a .NET textbox control where the user
can type XML, XSTL or HTML. And, it would be nice that the control can
do syntax coloring (and intellisense, later on), just like visual
studio does.

So I did a little test, by inheriting from RichTextBox control,
overriding OnTextChanged() and implementing something that already did
some syntax coloring with seperate words like 'public', 'class' etc;
but obviously, after your replies I now see that I did not define the
rules specific enough. The word 'public' in a string is not a keyword,
indeed. However, this test allowed me find away to completely avoid
flickering of the control, and now the next step is to do the syntax
coloring, with the rules of XML.

What the user enters into the textbox can be simple xml, like:

<?xml version="1.0" encoding="ISO-8859-1"?>
<note>
<to>Kevin</to>
<from>Ludwig</from>
<heading>Than k you</heading>
<body>Thanks for helping me!!</body>
</note>

or XSLT:

<?xml version="1.0" encoding="ISO-8859-1"?><xsl:styles heet
version="1.0"
xmlns:xsl="http ://www.w3.org/1999/XSL/Transform"><xsl :template
match="/">
<html>
<body>
<h2>My CD Collection</h2>
<table border="1">
<tr bgcolor="#9acd3 2">
<th align="left">Ti tle</th>
<th align="left">Ar tist</th>
</tr>
<xsl:for-each select="catalog/cd">
<tr>
<td><xsl:valu e-of select="title"/></td>
<td><xsl:valu e-of select="artist"/></td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template></xsl:stylesheet>

The idea is, that when the user types a character, the current caret
position in the textbox is taken to analyze the surrounding
words/characters, to see that there's a valid xml tag is formed, like
<body>, or <xsl:for-each select="catalog/cd">. If a valid tag has
formed, the tag name (body, xsl:for-each, select) should be colored in
a specific color (if it's in a list of valid tag names, but maybe we
can skip this for now). Also the < and > have to be colored in another
color. Attrributes also get another color (version, encoding, select).
Literals and other not defined xml elements like 'My CD Collection' do
not need coloring.

If the user deletes a character so that a tag becomes an invalid tag
(for example, deleting the >), then the coloring of the incomplete tag
has to be removed.

I hope that this time I explained the context better... didn't realize
that I had to be that specific.... so i's all about xml rules.

Mar 27 '06 #6
On 26 Mar 2006 02:56:00 -0800, "Cerebrus" <zo*****@sify.c om> wrote:
Hi Ludwig,

I can understand that you're trying to implement syntax highlighting in
your application.

What I don't understand is how you can use \b\w to catch a particular
word in your list. \b matches a word boundary, so that's ok, but \w
will match any word like character.

So to catch 'public' , if you used "\b\wpublic ", it would not match,
since the 'p' of public has already been matched by '\w'. Even tried it
with positive Lookbehind, but still doesn't work.

Unless the words in your list are something like : "ublic", "lass",
"yte"

Kindly clarify,

Regards,

Cerebrus.


Hi Cerebrus,

well, imagine the user types in:

public class

when the last 's' is typed, I use the regular expression \b\w in a
right-to-left-search to find the beginning of the word class, and then
I use the regular expression \w\b, starting from the position of this
word, to find the end of the word. This way I have found the word
'class' and check if it's a keyword, and color it.

This works for 'words', but of course, I did not define it enough,
because the word class is not always a keyword, for example when used
in a string.

So this would never work for the xml stuff I'm trying to do (see my
last reply to Kevin in this thread).

Thanks!
Mar 27 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
9751
by: Vibha Tripathi | last post by:
Hi Folks, I put a Regular Expression question on this list a couple days ago. I would like to rephrase my question as below: In the Python re.sub(regex, replacement, subject) method/function, I need the second argument 'replacement' to be another regular expression ( not a string) . So when I find a 'certain kind of string' in
5
2538
by: Bradley Plett | last post by:
I'm hopeless at regular expressions (I just don't use them often enough to gain/maintain knowledge), but I need one now and am looking for help. I need to parse through a document to find a URL, and then reconstruct another URL based on it. For example, I need to scan a web page looking for something like <a href="some_dir/list_20050815100225.csv">. I don't know in advance what the date/time in the file name will be. I need to take the...
10
3050
by: Lee Kuhn | last post by:
I am trying the create a regular expression that will essentially match characters in the middle of a fixed-length string. The string may be any characters, but will always be the same length. In other words, as the regular expression (....)($) matches the "4567" in the string "1234567", how would I create a similar regular expression that only matches the "45" in the same string. The same regular expression would match "32" in the string...
18
3048
by: Q. John Chen | last post by:
I have Vidation Controls First One: Simple exluce certain special characters: say no a or b or c in the string: * Second One: I required date be entered in "MM/DD/YYYY" format: //+4 How ??
5
3115
by: Ryan | last post by:
HELLO I am using the following MICROSOFT SUGGESTED (somewhere on msdn) regular expression to validate email addresses however I understand that the RFP allows for "+" symbols in the email address and this method does not.... Does anyone have an explanation? Function IsValidEmail(ByVal strIn As String) As Boolean
7
371
by: norton | last post by:
Hello, Does any one know how to extact the following text into 4 different groups(namely Date, Artist, Album and Quality)? - Artist - Album Artist - Album - Artist - Album - Artist - Album- i have try this syntax but it failed
7
3836
by: Billa | last post by:
Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I want to avoid that. My question here is if there is a way to pass either a memory stream or array of "find", "replace" expressions or any other way to avoid multiple copies of a string. Any help will be highly appreciated
3
2568
by: Zach | last post by:
Hello, Please forgive if this is not the most appropriate newsgroup for this question. Unfortunately I didn't find a newsgroup specific to regular expressions. I have the following regular expression. ^(.+?) uses (?!a spoon)\.$
25
5186
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)". How do I gain access to the expression (not the matches) at runtime? Thanks, Mike
0
10936
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10669
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10303
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9448
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7853
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
7025
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5695
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5882
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4498
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.