By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,338 Members | 1,346 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,338 IT Pros & Developers. It's quick & easy.

Regex for HTML

P: n/a
Hi all.
I have this database table (inherited from an legacy application) that
contains some information that I want to extract.
Basically, in one of the tables, there's a column containing a description
that starts with a NUMBER, but can be preceeded by some raw html elements.
Examples:
ex1:
<p>12 this is the first item ....
ex2:
<p>12. this is the first item ....
ex3:
<span id="my id" style="width:3" ><p>12. this is the first item ....
ex4:
12. this is the first item ....

I'm trying to extract the Number ("12" in all above examples)

The closest I got was when I tried the following regular expression pattern
:
string pattern = @"(<\w*>)*(?<digit>(\d+)).+";

It didn't match put the number in the right match group (= digit). I'm
still new to Regex.

Has anybody came accross any similar situation ?

thnks a bunch

TJ !
Nov 13 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
"TJoker .NET" <no****@nonono.no> wrote in
news:eZ**************@tk2msftngp13.phx.gbl:
Hi all.
I have this database table (inherited from an legacy application) that
contains some information that I want to extract.
Basically, in one of the tables, there's a column containing a
description that starts with a NUMBER, but can be preceeded by some
raw html elements. Examples:
ex1:
<p>12 this is the first item ....
ex2:
<p>12. this is the first item ....
ex3:
<span id="my id" style="width:3" ><p>12. this is the first item ....
ex4:
12. this is the first item ....

I'm trying to extract the Number ("12" in all above examples)

The closest I got was when I tried the following regular expression
pattern
:
string pattern = @"(<\w*>)*(?<digit>(\d+)).+";

It didn't match put the number in the right match group (= digit).
I'm still new to Regex.


hmm I'd try the following .NET regular expression:

"(<[^>]+>)*(?<digit>\d+)[^\d]"
0 or more tags where a tag is defined as starting with '<' followed by at
least 1 character not a '>' followed by a '>'.

followed by a string consisting of all the digits (at least 1) up to but
not including the 1st non digit. This could be a problem if it is
possible for the number to be the last thing on the line. It will work if
there are always characters that follow the number.

Mike
Nov 15 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.