By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,344 Members | 1,147 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,344 IT Pros & Developers. It's quick & easy.

regex pattern - ignore whitespace (CRLF and spaces)?

P: n/a
I have a HTML fragment that looks like this:

<tr>
<td valign="top" nowrap><span class="textBold">Property
ID: </span></td>
<td valign="top" nowrap colspan="4"
bgcolor="#F0F0F0"><b>&nbsp;01-068-24-64-1024</b></td>
</tr>

I am trying to extract the '' part of it.

This pattern works:

Property \s\ *ID: </span></td>\s\ *<td .*><b>&nbsp;(.*)</b>

I would like to simplify the pattern so that it will ignore new line
characters, &nbsp; and >1 contiguous space. Ideally, the HTML would look
like:

<tr><td valign="top" nowrap><span class="textBold">Property
ID:</span></td><td valign="top" nowrap colspan="4"
bgcolor="#F0F0F0"><b>01-068-24-64-1024</b></td></tr>

Does anyone have a suggestion on this?

Thanks a lot,

Craig Buchanan
Mar 27 '06 #1
Share this Question
Share on Google+
2 Replies


P: n/a

Craig Buchanan wrote:
I have a HTML fragment that looks like this: [snip] Does anyone have a suggestion on this?


My suggestion is to use HTMLAgilityPack, not Regex, for parsing HTML.

--
Larry Lard
Replies to group please

Mar 27 '06 #2

P: n/a
Hi Craig,
I am trying to extract the '' part of it.

Part of your question remains unclear to me, so I can only work with
assumptions :

1. Which HTML sample do you want to match ? The first or the second or
both ? I will assume both.
2. What part do you want to extract ? I think you missed out that part.From your Regex, you apparently want to match the part within the

<b>...</b> tags. That is what I will assume.

Here is a Regex to match : (Turn on "Dot matches Newline mode" for it
to work)

Property\s+ID:\s*</span></td>\s*<td.*?><b>(?:[&nbsp;]*)(.*)</b></td>

Points to note :
----------------------
1. Instead of \s\ * , I have used : \s+, in cases where there will be
atleast one space, and \s* where there might be zero or more spaces.
This can be changed to \s* in all cases. This matches all spaces, tabs
and line breaks.
2. If your lines break in an unanticipated position, the Regex will not
match.
3. In order to match zero or more &nbsp; special entities, I have used
(?:[&nbsp;]*). This will not store the entity in a backreference.
4. If you're using .NET, you can turn on "Dot matches newline" mode
using the RegexOptions.SingleLine option.
5. Regexes can only match very specific strings. Usually you can relax
it a bit for spaces and line breaks, but not for other characters. For
instance, if an &nbsp; is inserted anywhere else in the string, except
for within the <b>...</b> tags, the Regex will not match. So, if you're
expecting very diverse HTML fragments, you would be better off with
Larry's suggestion of using HTMLAgilityPack. It can be downloaded from
:

http://www.codefluent.com/smourier/d...gilitypack.zip

HTH,

Regards,

Cerebrus.

Mar 27 '06 #3

This discussion thread is closed

Replies have been disabled for this discussion.