HTML And Regular Explression

Ori

Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help….

Thanks,

Ori.

Nov 15 '05 #1

Subscribe Post Reply

1535

Nick Harris

yeah, use regular expressions. ;-)

Identify a uniqe patten in the web page. Say this is the stream from the
page:

<html>
<head>
<table>
<tr>
<td> Dentist </td>
<td>Phone</td>
</tr>
</table>
</head>
</html>

In this case you want to find the 2nd <td> tag and it's contents.

/.*<td>.*<td>(.*)<\/td>/

The expression above says give me all characters until you find the <td>
tag, then give me everything until you find another one. When you do store
the value between the send <td> tag and the first </td> tag you find and put
it in a variable for me.
that special var is $1. This is VI syntax, might be a bit diff in M$ land.

Hope this helps.
Nick Harris, MCSD
http://www.VizSoft.net

"Ori" <or*******@hotmail.com> wrote in message
news:b4**************************@posting.google.c om...

Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.

Nov 15 '05 #2

Peter Rilling

RegularExpressions are greedy by default. If you try to get everything
between the open and close tags, the engine will look for the last closing
tag and grab everthing in between. ".*" will consume as much as it can
before returning, so be careful of ".*". For example, if you have
<td>dsafdsfsf</td><td>fdsafdsfs</td>, then a regular expression like
<td>.*</td> will match on the entire content even thought you would
logically want two matches.

..NET has non-greedy modifies for their quantifiers. You might want to try
something like "\<html\>(.*?)\<html\>"

NOTE: Since most pages will only have one set of HTML tags, you might be
able to use something simple like "\<html\>(.*)\<html\>"

I have not actually tried to make sure the syntax is correct, but this
should give you an idea. This is all from memory. Also, the .NET
documentation describes all the quantifies that they support.
"Nick Harris" <nh*****@VizSoft.net> wrote in message
news:OW*************@tk2msftngp13.phx.gbl...

yeah, use regular expressions. ;-)

Identify a uniqe patten in the web page. Say this is the stream from the
page:

<html>
<head>
<table>
<tr>
<td> Dentist </td>
<td>Phone</td>
</tr>
</table>
</head>
</html>

In this case you want to find the 2nd <td> tag and it's contents.

/.*<td>.*<td>(.*)<\/td>/

The expression above says give me all characters until you find the <td>
tag, then give me everything until you find another one. When you do store
the value between the send <td> tag and the first </td> tag you find and put it in a variable for me.
that special var is $1. This is VI syntax, might be a bit diff in M$ land.
Hope this helps.
Nick Harris, MCSD
http://www.VizSoft.net

"Ori" <or*******@hotmail.com> wrote in message
news:b4**************************@posting.google.c om...
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.

Nov 15 '05 #3

BMermuys

Hi,

"Ori" <or*******@hotmail.com> wrote in message
news:b4**************************@posting.google.c om...

Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.
If you don't mind loosing all formating, having blank spaces and crlfs which
are ignored in html, then you can do it with regex replace:

string html ="..........";
string text = Regex.Replace( html, "\\<.*?\\>", "" );

HTH,
greetings

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.

Nov 15 '05 #4

Similar topics

Help with a regular expression

by: YoBro | last post by:

Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form...

PHP

in-line detection of html escape codes

by: yawnmoth | last post by:

say i have a for loop that would iterate through every character and put a space between every 80th one, in effect forcing word wrap to occur. this can be implemented easily using a regular...

PHP

Regular Expression to Parse HTML

by: Charles Law | last post by:

Does anyone have a regex pattern to parse HTML from a stream? I have a well structured file, where each line is of the form <sometag someattribute='attr'>text</sometag> for example <SPAN...

.NET Framework

html regular button is triggered with an enter key

by: Matt | last post by:

I want html regular button is triggered with an enter key. That means, when the user press ENTER key, it will trigger the event in html regular button. I tried the following code, and I realized...

Javascript

Trimming X/HTML files

by: Thomas SMETS | last post by:

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dear, I need to parse XHTML/HTML files in all ways : ~ _ Removing comments and javascripts is a first issue ~ _ Retrieving the list of fields...

Python

Shud not search HTML tags

by: anand | last post by:

Hello Group, i am stuck up to a problem, i made a search program on my web site and highlighting the searched phrase on HTML pages . well the problem is when user searches word "table" the Page...

ASP.NET

Need regular expression or some easier way to do HTML query string text substitution

by: comp.lang.php | last post by:

I am trying to replace within the HTML string $html the following: With Where I'm replacing "action=move_image" with "action=<?= $_REQUEST ?>"

PHP

Client found response content type of 'text/html; charset=Windows-

by: =?Utf-8?B?S2VzdGZpZWxk?= | last post by:

Hi Our company has a .Net web service that, when called via asp.net web pages across our network works 100%! The problem is that when we try and call the web service from a remote machine, one...

.NET Framework

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing