473,395 Members | 1,677 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

HTML And Regular Explression

Ori
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help….

Thanks,

Ori.
Nov 15 '05 #1
3 1535
yeah, use regular expressions. ;-)

Identify a uniqe patten in the web page. Say this is the stream from the
page:

<html>
<head>
<table>
<tr>
<td> Dentist </td>
<td>Phone</td>
</tr>
</table>
</head>
</html>

In this case you want to find the 2nd <td> tag and it's contents.

/.*<td>.*<td>(.*)<\/td>/

The expression above says give me all characters until you find the <td>
tag, then give me everything until you find another one. When you do store
the value between the send <td> tag and the first </td> tag you find and put
it in a variable for me.
that special var is $1. This is VI syntax, might be a bit diff in M$ land.

Hope this helps.
Nick Harris, MCSD
http://www.VizSoft.net

"Ori" <or*******@hotmail.com> wrote in message
news:b4**************************@posting.google.c om...
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.

Nov 15 '05 #2
RegularExpressions are greedy by default. If you try to get everything
between the open and close tags, the engine will look for the last closing
tag and grab everthing in between. ".*" will consume as much as it can
before returning, so be careful of ".*". For example, if you have
<td>dsafdsfsf</td><td>fdsafdsfs</td>, then a regular expression like
<td>.*</td> will match on the entire content even thought you would
logically want two matches.

..NET has non-greedy modifies for their quantifiers. You might want to try
something like "\<html\>(.*?)\<html\>"

NOTE: Since most pages will only have one set of HTML tags, you might be
able to use something simple like "\<html\>(.*)\<html\>"

I have not actually tried to make sure the syntax is correct, but this
should give you an idea. This is all from memory. Also, the .NET
documentation describes all the quantifies that they support.
"Nick Harris" <nh*****@VizSoft.net> wrote in message
news:OW*************@tk2msftngp13.phx.gbl...
yeah, use regular expressions. ;-)

Identify a uniqe patten in the web page. Say this is the stream from the
page:

<html>
<head>
<table>
<tr>
<td> Dentist </td>
<td>Phone</td>
</tr>
</table>
</head>
</html>

In this case you want to find the 2nd <td> tag and it's contents.

/.*<td>.*<td>(.*)<\/td>/

The expression above says give me all characters until you find the <td>
tag, then give me everything until you find another one. When you do store
the value between the send <td> tag and the first </td> tag you find and put it in a variable for me.
that special var is $1. This is VI syntax, might be a bit diff in M$ land.
Hope this helps.
Nick Harris, MCSD
http://www.VizSoft.net

"Ori" <or*******@hotmail.com> wrote in message
news:b4**************************@posting.google.c om...
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.


Nov 15 '05 #3
Hi,

"Ori" <or*******@hotmail.com> wrote in message
news:b4**************************@posting.google.c om...
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.
If you don't mind loosing all formating, having blank spaces and crlfs which
are ignored in html, then you can do it with regex replace:

string html ="..........";
string text = Regex.Replace( html, "\\<.*?\\>", "" );

HTH,
greetings


I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.

Nov 15 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: YoBro | last post by:
Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form...
11
by: yawnmoth | last post by:
say i have a for loop that would iterate through every character and put a space between every 80th one, in effect forcing word wrap to occur. this can be implemented easily using a regular...
23
by: Charles Law | last post by:
Does anyone have a regex pattern to parse HTML from a stream? I have a well structured file, where each line is of the form <sometag someattribute='attr'>text</sometag> for example <SPAN...
3
by: Matt | last post by:
I want html regular button is triggered with an enter key. That means, when the user press ENTER key, it will trigger the event in html regular button. I tried the following code, and I realized...
2
by: Thomas SMETS | last post by:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dear, I need to parse XHTML/HTML files in all ways : ~ _ Removing comments and javascripts is a first issue ~ _ Retrieving the list of fields...
3
by: anand | last post by:
Hello Group, i am stuck up to a problem, i made a search program on my web site and highlighting the searched phrase on HTML pages . well the problem is when user searches word "table" the Page...
2
by: comp.lang.php | last post by:
I am trying to replace within the HTML string $html the following: With Where I'm replacing "action=move_image" with "action=<?= $_REQUEST ?>"
13
by: =?Utf-8?B?S2VzdGZpZWxk?= | last post by:
Hi Our company has a .Net web service that, when called via asp.net web pages across our network works 100%! The problem is that when we try and call the web service from a remote machine, one...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.