473,698 Members | 2,602 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Parsing / processing a stream of HTML

Hi,

I'm using HttpWebRequest and HttpWebResponse to return a stream of HTML.
Looking for advice as to the accepted / easiest / most efficient way to
process this HTML in the background i.e. I don't want to display it all to
the user, just pull out certain pieces of it.

Specifically, I'm looking to evaluate the tabledefs it contains - walk
through their rows and columns etc.

Any assistance gratefully received as ever.

Best regards,

Mark Rae
Nov 15 '05 #1
6 6297
Mark,
I would seriously consider using regular expressions to extract the
content you are looking for out of your html string.

http://www.regular-expressions.info/dotnet.html
http://www.ondotnet.com/pub/a/dotnet...11/regex2.html
--
Jay Douglas
Fort Collins, CO

"Mark Rae" <ma**@markrae.c o.uk> wrote in message
news:uK******** ********@TK2MSF TNGP10.phx.gbl. ..
Hi,

I'm using HttpWebRequest and HttpWebResponse to return a stream of HTML.
Looking for advice as to the accepted / easiest / most efficient way to
process this HTML in the background i.e. I don't want to display it all to
the user, just pull out certain pieces of it.

Specifically, I'm looking to evaluate the tabledefs it contains - walk
through their rows and columns etc.

Any assistance gratefully received as ever.

Best regards,

Mark Rae

Nov 15 '05 #2
"Jay Douglas" <RE************ *************** ******@squarei. com> wrote in
message news:uT******** ********@TK2MSF TNGP10.phx.gbl. ..
Mark,
I would seriously consider using regular expressions to extract the
content you are looking for out of your html string.

http://www.regular-expressions.info/dotnet.html
http://www.ondotnet.com/pub/a/dotnet...11/regex2.html


Thanks for the reply. Will that, e.g. allow me to extract all the text
between "<table" and "</table>"?

Alternatively, is there a way to reference a stream of HTML and treat it as
if it were an HTML document from which I could evaluate the tabledefs
collection etc?

Mark
Nov 15 '05 #3
Mark,

With regular expressions, you can extract text from all sorts of
different patterns including text in-between table tags.

Now about changing attributes and elements of the html string... I've
seen some examples where html is actually transformed into xml string and
then attributes of certain elements are then modified then returned back to
an html string.

Here's a link to start your research with:

http://www.fawcette.com/vsm/2002_03/..._wagner_03_18/
--
Jay Douglas
Fort Collins, CO

"Mark Rae" <ma**@markrae.c o.uk> wrote in message
news:ep******** ********@tk2msf tngp13.phx.gbl. ..
Thanks for the reply. Will that, e.g. allow me to extract all the text
between "<table" and "</table>"?

Alternatively, is there a way to reference a stream of HTML and treat it as if it were an HTML document from which I could evaluate the tabledefs
collection etc?

Mark

Nov 15 '05 #4
"Jay Douglas" <RE************ *************** ******@squarei. com> wrote in
message news:e5******** ********@TK2MSF TNGP11.phx.gbl. ..

Jay,

With regular expressions, you can extract text from all sorts of
different patterns including text in-between table tags.

Now about changing attributes and elements of the html string... I've
seen some examples where html is actually transformed into xml string and
then attributes of certain elements are then modified then returned back to an html string.

Here's a link to start your research with:

http://www.fawcette.com/vsm/2002_03/..._wagner_03_18/


Thanks for this. I looked at it, and found that it was more than I needed.

In the end, I extracted the various <tr>...</tr> lines out of the HTML
stream, and then processeded them with the standard Substring() and
IndexOf() methods of the String object.

Job done.

Best,

Mark
Nov 15 '05 #5
Jay Douglas wrote:
Mark,

With regular expressions, you can extract text from all sorts of
different patterns including text in-between table tags.


Not really. You cannot match corresponding opening and closing tags for
example, because there's no way to express such constructs using regular
expressions (see context-free grammars).

I'd rather use a real parser such as the Chris Lovett's SGML parser.

http://www.gotdotnet.com/Community/U...4-C3BD760564BC

Cheers,

--
Joerg Jooss
jo*********@gmx .net

Nov 15 '05 #6
"Joerg Jooss" <jo*********@gm x.net> wrote in message
news:eC******** ******@TK2MSFTN GP11.phx.gbl...
Jay Douglas wrote:
Mark,

With regular expressions, you can extract text from all sorts of
different patterns including text in-between table tags.
Not really. You cannot match corresponding opening and closing tags for
example, because there's no way to express such constructs using regular
expressions (see context-free grammars).


I'm having no problems thus far extracting strings between the following
tags:

<tr>...</tr>
<td>...</td>
<p>...</p>

I'd rather use a real parser such as the Chris Lovett's SGML parser.

http://www.gotdotnet.com/Community/U...4-C3BD760564BC

Very useful!

Mark
Nov 15 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
1808
by: Girish | last post by:
Hello.. I need to understand how parsing of a file or a stream(XML data in memory) takes place in Xerces C++. I am using SAX XMLReader and passing xml input as a file path or a MemBufInputSource object. In order to create a MemBufInputSource object I need to have the entire data in memory(as given in MemParse sample). This may create problems if I have a large amount of data, say in hundreds of MB. In case of files, does the parser...
0
1015
by: creativewebpros | last post by:
I know there is a way to designate a HTML folder to export a Crystal Report's output. However, I would like to export the report to a HTML stream instead. I understand that the designated HTML folder is used to hold images and potentially other resources the reports need, but I wonder if the HTML could still be streamed and make reference to the resources within the HTML folder. The export command would actually be...
0
1994
by: june | last post by:
Hi, I have a big problem with parsing HTML into a XHTML using Cberneko to validate the html. First I tried to work with a HTML-File. This solutions works fine: String aHTMLFile = "file:\\C:/work/Eclipse3.1.1/html-file.html"; org.xml.sax.InputSource pSource = new InputSource(aHTMLFile);
4
1812
by: baldwasagar | last post by:
I want to parse a HTML file in Java which has JavaScript also in it. I want to fetch the data of Java Script tag also. The tag is SCRIPT. Please help with suggestions / solutions. I have tried using Java HTMLEditorKit API but it does not work for SCRIPT tag/ Regards, Sagar
29
2142
by: lenbell | last post by:
It's old stupid and lazy here again I have been wanting to keep using my WYSIWYG (What You See Is What You Get - for my fellow stupids) html editor. But I was told that you HAD to rename your files to .PHP so they would be parsed correctly. Oh contraire if you are hosted by Apache and have some access to the .htaccess file mechanism In my case through the "cPanel" and then "Apache Handlers"
0
9030
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8899
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8871
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7738
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6528
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5861
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4622
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2335
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2007
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.