473,949 Members | 1,671 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Problem with a Regex

I am having a problem matching some text. It is a very simple pattern
but it doesn't seem to work. Here goes.

<td[^>]*>.*?</td>

That is the pattern, it should match any <td></td> pair. Here is my
input data.

<td valign="top">Bu yer<a href="http://www.google.com" >google</a><img
src="www.google .com/s.gif" width="4" border="0">(<a
href="www.googl e.com">9</a> )<span> </span></td>
<td valign="top">
Buyer

<a href="http://www.google.com" >google</a><img
src="www.google .com/s.gif" width="4" border="0">
(
<a href="www.googl e.com">9</a> )<span> </span></td>

The first and second are exactly the same but the first has the spaces
removed. The pattern will match the first but not match the second. I
am very confused.

I have ran some tests. This pattern will match the first but not the
second.

<td[^>]*>.*?Buyer

This will match both of them.

<td[^>]*>\s*?Buyer

This indicates to me that the '.' is not matching a space character.
Any ideas?

Mar 7 '06 #1
9 2814
taylorjonl wrote:
I am having a problem matching some text. It is a very simple pattern
but it doesn't seem to work. Here goes.

<td[^>]*>.*?</td>

That is the pattern, it should match any <td></td> pair.


Just out of interest, what are you expecting the '?' to do? Usually it
comes after a different character that you want to match 0 or 1 times -
but in this case you don't have a previous character (the .* is the
previous bit).

I'm far from an expert on regexes, but I don't understand what that '?'
will actually match.
It may be part of the problem.

Jon

Mar 7 '06 #2
The ? after the * tells the regex to be non-greedy. Normally it is
greedy so if we had the input of

<td>bucket1</td><td>bucket2</td><td>bucket3</td>

<td[^>]*>.*</td>

would match

<td>bucket1</td><td>bucket2</td><td>bucket3</td>

Because well, it is greedy and will make the largest match. Adding the
? tells it to be non-greedy.

Mar 7 '06 #3
Isn't this just due to the dot NOT matching newlines by default (while
\n is included in \s)

[MSDN about dot:]
"Matches any character except \n. If modified by the Singleline option,
a period character matches any character. For more information, see
Regular Expression Options."

Mar 7 '06 #4
Hi taylorjonl and Jon Skeet,

I have a few points to make here :

1. Analyzing the sample string you gave and the 1st Regex pattern
(<td[^>]*>.*?</td>), I realized that it matches perfectly. It is what
you need. The only thing that you need to do now, is enable the Regex
option to allow the dot "." to match a Newline character. This equates
to "Dot matches Newline" in other Regex flavours and
RegexOptions.Si ngleLine in .NET.

I don't know which Regex validator you're using to run your tests, but
just try it with that option enabled, and it will definitely work.

2. The ".*?" - This has a special meaning. ".*" alone means "Match any
character any no. of times, as many times as possible (Greedily)" and
".*?" means "Match any character any no. of times, but as few times as
possible (Lazily)".
The difference between a Greedy and a Lazy match is that the former
will match as many occurrences as possible, while the second will match
as few as possible. The latter will give you the shortest match.
I usually use (.*?) to match anything between any other text.
If you simply used the Regex pattern "<td(.*?)</td>" it would still
solve taylorjonl's problem. It just means match anything that comes
between 2 <td>'s. (including spaces, newlines and what not !)

3. I think the important point in deciding any Regex Pattern is what
you want to retrieve from it. (what will be stored in the back
reference). For instance, in your sample string, what exactly do you
intend to retrieve ? Whatever it is should be in brackets.

Assuming it's the "Buyer" part, use this Regex pattern (Remember to set
the RegexOptions.Ne wline flag option)

<td[^>]*>(.*?)<a.*?</td>

Try a replace action with the Regex pattern "$1" (.NET notation), and
you will have found some Buyers !!! ;-)

Hope this helps,

Regards,

Cerebrus.

Mar 7 '06 #5
You are the bomb, this has been driving me nuts trying to figure it out
and I know it had to be something simple. If .NET would only behave
like the rest of the world when it comes to regular expressions.

Thanks again, works like a charm now.

Mar 7 '06 #6
Well, you know... .NET is... kinda Exceptional !!! ;-)

BTW, what part of that sample string did you want to retrieve ?

Regards,

Cerebrus.

Mar 7 '06 #7
And you're most welcome...

Regards,

Cerebrus.

Mar 7 '06 #8
That string is just a test string I used. I am actually going to be
extracting certain pieces of information from an eBay feedback page. I
have since the last post came up with the following do to this so far.

using System.Text.Reg ularExpressions ;

Regex regex = new Regex(
@"<tr[^>]*>[^<]*<td></td>[^<]*<td[^>]*>.*?alt=""(?<t ype>[^"""
+
@"]+)""></td>[^<]*<td></td>[^<]*<td[^>]*>(?<message>[^<]*)"
+
@"<br></td>[^<]*<td></td>[^<]*<td[^>]*>.*?</td>[^<]*<td>"
+
@"</td>[^<]*<td[^>]*>(?<date>.*?) </td>[^<]*<td></td>[^<]*
"
+ @"
<td[^>]*>[^>]*>(?<item>\d{10 })</a></td>[^<]*<td></td>[^<"
+ @"]*</tr>",
RegexOptions.Ig noreCase
| RegexOptions.Mu ltiline
| RegexOptions.Si ngleline
| RegexOptions.Ig norePatternWhit espace
| RegexOptions.Co mpiled
);

That will get my all the importan sections that I can reference by
name.

I am using a program called Expresso which is wonderful for testing
these out.

Thanks for the help.

Mar 7 '06 #9
taylorjonl wrote:
The ? after the * tells the regex to be non-greedy. Normally it is
greedy so if we had the input of

<td>bucket1</td><td>bucket2</td><td>bucket3</td>

<td[^>]*>.*</td>

would match

<td>bucket1</td><td>bucket2</td><td>bucket3</td>

Because well, it is greedy and will make the largest match. Adding the
? tells it to be non-greedy.


Aha - great, thanks for that. There's always more to know about
regexes...

Jon

Mar 7 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
2348
by: Henry | last post by:
I have this simple code, string escaped = Regex.Escape( @"`~!@#$%^&*()_=+{}\|;:',<.>/?" + "\"" ); string input = @"a&+" + "\"" + @"@(-d)\e"; Regex re = new Regex( string.Format(@"(+)", escaped), RegexOptions.CultureInvariant ); string s = re.Replace( input, "" ); It doesn't seem to work, regular expression return without filter out any character
7
7820
by: derek.google | last post by:
I hope a Boost question is not too off-topic here. It seems that upgrading to Boost 1.33 broke some old regex code that used to work. I have reduced the problem to this simple example: cout << boost::regex_replace(string("foo"), boost::regex(".*"), string("bar")) << endl; The above code prints "barbar" where I expect "bar". Can anyone shed some light on this? It used to work with 1.30 (though regex_replace
5
5437
by: James Dean | last post by:
I wanted to use regular expressions but unfortunetely it is too slow.....Should they be so slow or am i doing something wrong. I am reading in bytes from a file then converting them to char then making a string out of each of the individual bytes. I check if its in the correct format...and take out the various paretres i need. It looked nice and neat so i am not happy that i may have to use another method.....any alternative solutions?.
4
1371
by: | last post by:
Here is an interesting one. Running asp.net 2.0 beta 2. I have a regular expression used in a regex validator that works on the client side in Firefox but not in IE. Any ideas? IE always reports the field is invalid. The expression is: ^(?!\d)(?=.*\d)(?=.*)(?=.*)(?=.*).{8,25}$ If I enter "Test_Field1" Firefox considers it valid on client side, IE doesnt. Server side considers it valid too because when I submit the form in
4
2808
by: ad | last post by:
I am useing VS2005 to develop wep application. I use a RegularExpress both in RegularExpressionValidator and Regex class to validate a value. The RegularExpress is 20|\-9|\-1|?\d{1} When I enter 33 and validate with RegularExpressionValidator, it fail to pass. But when I validate with regex class : Regex.IsMatch(Sight0L, @"20|\-9|\-1|?\d{1}");
7
2241
by: =?Utf-8?B?amFj?= | last post by:
Hi, I have problems with following code and don’t find the bug : // Set ArrayList aArray = new ArrayList(); regStr = new Regex(@"\?)*(\d+)\]"); if(text != null && regStr.IsMatch(text)) {
2
2911
by: apoorva.groups | last post by:
Hi I am facing problem while using regexec function. Ex: String = "abc_def_hig" sub string = "def" regexc if I use regexec the it will find the sub string in string and it will return 0. I want to modify the sub string such that it matches only if the
5
8801
by: mikko.n | last post by:
I have recently been experimenting with GNU C library regular expression functions and noticed a problem with pattern matching. It seems to recognize only the first match but ignoring the rest of them. An example: mikko.c: ----- #include <stdio.h> #include <regex.h>
3
2211
by: =?Utf-8?B?TWFya19C?= | last post by:
The following is working for me but I want to include numbers in scientific notation. public double Evaluate( string expr ) { const string Num = @"(\-?\d+\.?\d*|\-?\.\d+)" Regex reMulDiv = new Regex(Num + @"\s*()\s*" + Num); other stuff:
0
9990
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11593
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
11189
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10699
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6129
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
6349
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4955
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
4546
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3554
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.