469,648 Members | 1,421 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,648 developers. It's quick & easy.

HTML And Regular Explression

Ori
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help….

Thanks,

Ori.
Nov 15 '05 #1
3 1435
yeah, use regular expressions. ;-)

Identify a uniqe patten in the web page. Say this is the stream from the
page:

<html>
<head>
<table>
<tr>
<td> Dentist </td>
<td>Phone</td>
</tr>
</table>
</head>
</html>

In this case you want to find the 2nd <td> tag and it's contents.

/.*<td>.*<td>(.*)<\/td>/

The expression above says give me all characters until you find the <td>
tag, then give me everything until you find another one. When you do store
the value between the send <td> tag and the first </td> tag you find and put
it in a variable for me.
that special var is $1. This is VI syntax, might be a bit diff in M$ land.

Hope this helps.
Nick Harris, MCSD
http://www.VizSoft.net

"Ori" <or*******@hotmail.com> wrote in message
news:b4**************************@posting.google.c om...
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.

Nov 15 '05 #2
RegularExpressions are greedy by default. If you try to get everything
between the open and close tags, the engine will look for the last closing
tag and grab everthing in between. ".*" will consume as much as it can
before returning, so be careful of ".*". For example, if you have
<td>dsafdsfsf</td><td>fdsafdsfs</td>, then a regular expression like
<td>.*</td> will match on the entire content even thought you would
logically want two matches.

..NET has non-greedy modifies for their quantifiers. You might want to try
something like "\<html\>(.*?)\<html\>"

NOTE: Since most pages will only have one set of HTML tags, you might be
able to use something simple like "\<html\>(.*)\<html\>"

I have not actually tried to make sure the syntax is correct, but this
should give you an idea. This is all from memory. Also, the .NET
documentation describes all the quantifies that they support.
"Nick Harris" <nh*****@VizSoft.net> wrote in message
news:OW*************@tk2msftngp13.phx.gbl...
yeah, use regular expressions. ;-)

Identify a uniqe patten in the web page. Say this is the stream from the
page:

<html>
<head>
<table>
<tr>
<td> Dentist </td>
<td>Phone</td>
</tr>
</table>
</head>
</html>

In this case you want to find the 2nd <td> tag and it's contents.

/.*<td>.*<td>(.*)<\/td>/

The expression above says give me all characters until you find the <td>
tag, then give me everything until you find another one. When you do store
the value between the send <td> tag and the first </td> tag you find and put it in a variable for me.
that special var is $1. This is VI syntax, might be a bit diff in M$ land.
Hope this helps.
Nick Harris, MCSD
http://www.VizSoft.net

"Ori" <or*******@hotmail.com> wrote in message
news:b4**************************@posting.google.c om...
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.


Nov 15 '05 #3
Hi,

"Ori" <or*******@hotmail.com> wrote in message
news:b4**************************@posting.google.c om...
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.
If you don't mind loosing all formating, having blank spaces and crlfs which
are ignored in html, then you can do it with regex replace:

string html ="..........";
string text = Regex.Replace( html, "\\<.*?\\>", "" );

HTH,
greetings


I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.

Nov 15 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

7 posts views Thread by YoBro | last post: by
11 posts views Thread by yawnmoth | last post: by
23 posts views Thread by Charles Law | last post: by
2 posts views Thread by Thomas SMETS | last post: by
3 posts views Thread by anand | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.