473,796 Members | 2,460 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

HTML And Regular Explression

Ori
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help….

Thanks,

Ori.
Nov 15 '05 #1
3 1555
yeah, use regular expressions. ;-)

Identify a uniqe patten in the web page. Say this is the stream from the
page:

<html>
<head>
<table>
<tr>
<td> Dentist </td>
<td>Phone</td>
</tr>
</table>
</head>
</html>

In this case you want to find the 2nd <td> tag and it's contents.

/.*<td>.*<td>(.* )<\/td>/

The expression above says give me all characters until you find the <td>
tag, then give me everything until you find another one. When you do store
the value between the send <td> tag and the first </td> tag you find and put
it in a variable for me.
that special var is $1. This is VI syntax, might be a bit diff in M$ land.

Hope this helps.
Nick Harris, MCSD
http://www.VizSoft.net

"Ori" <or*******@hotm ail.com> wrote in message
news:b4******** *************** ***@posting.goo gle.com...
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.

Nov 15 '05 #2
RegularExpressi ons are greedy by default. If you try to get everything
between the open and close tags, the engine will look for the last closing
tag and grab everthing in between. ".*" will consume as much as it can
before returning, so be careful of ".*". For example, if you have
<td>dsafdsfsf </td><td>fdsafdsf s</td>, then a regular expression like
<td>.*</td> will match on the entire content even thought you would
logically want two matches.

..NET has non-greedy modifies for their quantifiers. You might want to try
something like "\<html\>(.*?)\ <html\>"

NOTE: Since most pages will only have one set of HTML tags, you might be
able to use something simple like "\<html\>(.*)\< html\>"

I have not actually tried to make sure the syntax is correct, but this
should give you an idea. This is all from memory. Also, the .NET
documentation describes all the quantifies that they support.
"Nick Harris" <nh*****@VizSof t.net> wrote in message
news:OW******** *****@tk2msftng p13.phx.gbl...
yeah, use regular expressions. ;-)

Identify a uniqe patten in the web page. Say this is the stream from the
page:

<html>
<head>
<table>
<tr>
<td> Dentist </td>
<td>Phone</td>
</tr>
</table>
</head>
</html>

In this case you want to find the 2nd <td> tag and it's contents.

/.*<td>.*<td>(.* )<\/td>/

The expression above says give me all characters until you find the <td>
tag, then give me everything until you find another one. When you do store
the value between the send <td> tag and the first </td> tag you find and put it in a variable for me.
that special var is $1. This is VI syntax, might be a bit diff in M$ land.
Hope this helps.
Nick Harris, MCSD
http://www.VizSoft.net

"Ori" <or*******@hotm ail.com> wrote in message
news:b4******** *************** ***@posting.goo gle.com...
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.

I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.


Nov 15 '05 #3
Hi,

"Ori" <or*******@hotm ail.com> wrote in message
news:b4******** *************** ***@posting.goo gle.com...
Hi , I'm working with C#.NET and I'm looking for the following.

I have a web page content and I want to pull all the text which appear
in the page without all the HTML tags.
If you don't mind loosing all formating, having blank spaces and crlfs which
are ignored in html, then you can do it with regex replace:

string html =".......... ";
string text = Regex.Replace( html, "\\<.*?\\>" , "" );

HTH,
greetings


I know that there is a way to do it with regular expression.

Does someone know how to do it ?

Please help..

Thanks,

Ori.

Nov 15 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
2599
by: YoBro | last post by:
Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form field beside name. I want to get the following but not sure how to modify the code below. 1. Field Name (to appear beside NAME:) 2. Field Type (to appear beside TYPE:)
11
4896
by: yawnmoth | last post by:
say i have a for loop that would iterate through every character and put a space between every 80th one, in effect forcing word wrap to occur. this can be implemented easily using a regular expression. if i wanted to improve on this, and make it so stuff in url's didn't count towards that 80 character limit, a regular expression would not suffice. however, a simple for loop does. so now i'm currious how to account for html escape...
23
2592
by: Charles Law | last post by:
Does anyone have a regex pattern to parse HTML from a stream? I have a well structured file, where each line is of the form <sometag someattribute='attr'>text</sometag> for example <SPAN CLASS='myclass'>A bit of text</SPAN>, or Just some text, without tags
3
2091
by: Matt | last post by:
I want html regular button is triggered with an enter key. That means, when the user press ENTER key, it will trigger the event in html regular button. I tried the following code, and I realized submit button is triggered with an enter key. But not html regular button. Any workarounds? Please advise. Thanks!! <INPUT TYPE="button" value="regular button" onClick="alert('button');"> <INPUT TYPE="submit" value="submit button"...
2
1453
by: Thomas SMETS | last post by:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dear, I need to parse XHTML/HTML files in all ways : ~ _ Removing comments and javascripts is a first issue ~ _ Retrieving the list of fields to submit is my following item (todo)
3
1099
by: anand | last post by:
Hello Group, i am stuck up to a problem, i made a search program on my web site and highlighting the searched phrase on HTML pages . well the problem is when user searches word "table" the Page also displayes highlighted "<table>" along with other page contents is there any way i can skip highlighting html, if i remove HTML tags the page is not displayed properly as i amd dumping HTML source from a table field.So finally my concern is on...
2
2015
by: comp.lang.php | last post by:
I am trying to replace within the HTML string $html the following: With Where I'm replacing "action=move_image" with "action=<?= $_REQUEST ?>"
13
31427
by: =?Utf-8?B?S2VzdGZpZWxk?= | last post by:
Hi Our company has a .Net web service that, when called via asp.net web pages across our network works 100%! The problem is that when we try and call the web service from a remote machine, one outside of our domain, we get the error.. ** Client found response content type of 'text/html; charset=Windows-1252', but expected 'text/xml' **. We can discover the web service by typing in the url of the asmx so we know the server can 'see' it...
0
10457
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10176
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10013
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6792
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5443
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5576
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4119
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3733
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2927
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.