473,563 Members | 2,857 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

ASP Question: Parse HTML file?

Hi all,

I'm working on a project where there are just under 1300 course files, these
are HTML files - my problem is that I need to do more with the content of
these pages - and the thought of writing 1300 asp pages to deal with this
doesn't thrill me.

The HTML pages are provided by a training company. They seem to be
"structured " to some degree, but I'm not sure how easy its going to be to
parse the page.

Typically there are the following "sections" of each page:

Title
Summary
Topics
Technical Requirements
Copyright Information
Terms Of Use

I need to get the content for the Title, Summary, Topics, Technical
Requirements and lose the Copyright and Terms of use...in addition I need to
squeeze in a new section which will display pricing information and a link
to "Add to cart" etc....

My "plan" (if you can call it that) was to have 1 asp page which can parse
the appropriate HTML file based on the asp page being passed a code in the
querystring - the code will match the filename of the HTML page (the first
part prior to the dot).

What I then need to do is go through the content of the HTML....this is
where I am currently stuck....

I have pasted an example of one of these pages below - if anyone can suggest
to me how I might achieve this I would be most grateful - in addition - if
anyone can explain the XML Name Space stuff in there that would be handy
too - I figure this is just a normal HTML page, as there is no declaration
or anything at the top?

Any information/suggestions would be most appreciated.

Thanks in advance for your help,

Regards

Rob
Example file:

<html>
<head>
<title>Novell 560 CNE Series: File System</title>
<meta name="Descripti on" content="">
<link rel="stylesheet " href="../resource/mlcatstyle.css"
type="text/css">
</head>
<body class="MlCatPag e">
<table class="Header" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="Logo" colspan="2">
<img class="Logo" src="../images/logo.gif">
</td>
</tr>
<tr>
<td class="Title">
<div class="ProductT itle">
<span class="CoCat">N ovell 560 CNE Series: File System</span>
</div>
<div class="ProductD etails">
<span class="SmallTex t">
<span class="BoldText "> Product Code: </span>
560c04<span class="BoldText "> Time: </span>
4.0 hour(s)<span class="BoldText "> CEUs: </span>
Available</span>
</div>
</td>
<td class="Back">
<div class="BackButt on">
<a href="javascrip t:history.back( )">
<img src="../images/back.gif" align="right" border="0">
</a>
</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="HighLeve l" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHea der">
<h3 class="sectiont ext">Summary:</h3>
</td>
</tr>
<tr>
<td class="Overview ">
<div class="ProductS ummary">This course provides an introduction
to NetWare 5 file system concepts and management procedures.</div>
<br>
<h3 class="Sectiont ext">Objectives :</h3>
<div class="FreeText ">After completing this course, students will
be able to: </div>
<div class="Objectiv eList">
<ul class="listing" >
<li class="Objectiv eItem">Explain the relationship of the file
system and login scripts</li>
<li class="Objectiv eItem">Create login scripts</li>
<li class="Objectiv eItem">Manage file system directories and
files</li>
<li class="Objectiv eItem">Map network drives</li>
</ul>
</div>
<br></br>
<h3 class="Sectiont ext">Topics:</h3>
<div class="OutlineL ist">
<ul class="listing" >
<li class="OutlineI tem">Managing the File System</li>
<li class="OutlineI tem">Volume Space</li>
<li class="OutlineI tem">Examining Login Scripts</li>
<li class="OutlineI tem">Creating and Executing Login
Scripts</li>
<li class="OutlineI tem">Drive Mappings</li>
<li class="OutlineI tem">Login Scripts and Resources</li>
</ul>
</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="Details" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHea der">
<h3 class="Sectiont ext">Technical Requirements:</h3>
</td>
</tr>
<tr>
<td class="Details" >
<div class="ProductR equirements">20 0MHz Pentium with 32MB Ram. 800
x 600 minimum screen resolution. Windows 98, NT, 2000, or XP. 56K minimum
connection speed, broadband (256 kbps or greater) connection recommended.
Internet Explorer 5.0 or higher required. Flash Player 7.0 or higher
required. JavaScript must be enabled. Netscape, Firefox and AOL browsers not
supported.</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="Legal" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHea der">
<h3 class="Sectiont ext">Copyright Information:</h3>
</td>
</tr>
<tr>
<td class="Copyrigh t">
<div class="ProductR equirements">Pr oduct names mentioned in this
catalog may be trademarks/servicemarks or registered trademarks/servicemarks
of their respective companies and are hereby acknowledged. All product
names that are known to be trademarks or service marks have been
appropriately capitalized. Use of a name in this catalog is for
identification purposes only, and should not be regarded as affecting the
validity of any trademark or service mark, or as suggesting any affiliation
between MindLeaders.com , Inc. and the trademark/servicemark
proprietor.</div>
<br>
<h3 class="Sectiont ext">Terms of Use:</h3>
<div class="ProductU senote"></div>
</td>
</tr>
</table>
<p align="center">
<span class="SmallTex t">Copyright &copy; 2006 MindLeaders. All rights
reserved.</span>
</p>
</body>
</html>
May 18 '06 #1
14 7337

Rob Meade wrote:
Hi all,

I'm working on a project where there are just under 1300 course files, these
are HTML files - my problem is that I need to do more with the content of
these pages - and the thought of writing 1300 asp pages to deal with this
doesn't thrill me.

The HTML pages are provided by a training company. They seem to be
"structured " to some degree, but I'm not sure how easy its going to be to
parse the page.

Typically there are the following "sections" of each page:

Title
Summary
Topics
Technical Requirements
Copyright Information
Terms Of Use


If you can identify the specific divs that hold this information (and
they are consistent across pages), you could use regex to parse the
files and pop the relevant bits into a database.

--
Mike Brind

May 18 '06 #2

I have pasted an example of one of these pages below - if anyone can suggest to me how I might achieve this I would be most grateful - in addition - if
anyone can explain the XML Name Space stuff in there that would be handy
too - I figure this is just a normal HTML page, as there is no declaration
or anything at the top?


These pages will have been generated via an XSLT transform. The transform
will have made use of these namespaces. However unless informed otherwise
XSLT will output the xmlns tags for these namespaces even though no element
is output belonging to them which is the case here.

That's a long winded way of saying they don't do anything, ignore them.

It's a pity they didn't go the whole hog and output the whole page as XML it
would be a lot easier to do what you need. Still it's a good sign that the
content of the other 1299 pages are likely to be consistent so Mike's idea
of scanning with RegExp should work.

Anthony.
May 18 '06 #3
"Rob Meade" <ro********@N O-SPAM.kingswoodw eb.net> wrote in message
news:PS******** *********@text. news.blueyonder .co.uk...
Hi all,

I'm working on a project where there are just under 1300 course files, these are HTML files - my problem is that I need to do more with the content of
these pages - and the thought of writing 1300 asp pages to deal with this
doesn't thrill me.

The HTML pages are provided by a training company. They seem to be
"structured " to some degree, but I'm not sure how easy its going to be to
parse the page.

Typically there are the following "sections" of each page:

Title
Summary
Topics
Technical Requirements
Copyright Information
Terms Of Use

I need to get the content for the Title, Summary, Topics, Technical
Requirements and lose the Copyright and Terms of use...in addition I need to squeeze in a new section which will display pricing information and a link
to "Add to cart" etc....

My "plan" (if you can call it that) was to have 1 asp page which can parse
the appropriate HTML file based on the asp page being passed a code in the
querystring - the code will match the filename of the HTML page (the first
part prior to the dot).

What I then need to do is go through the content of the HTML....this is
where I am currently stuck....

I have pasted an example of one of these pages below - if anyone can suggest to me how I might achieve this I would be most grateful - in addition - if
anyone can explain the XML Name Space stuff in there that would be handy
too - I figure this is just a normal HTML page, as there is no declaration
or anything at the top?

Any information/suggestions would be most appreciated.


[snip]

Consider displaying their page inside of an <iframe>
inside of a page that has your content.

"The iframe element creates an inline frame that contains another document."
http://www.w3schools.com/tags/tag_iframe.asp
May 18 '06 #4
"McKirahan" wrote ...
Consider displaying their page inside of an <iframe>
inside of a page that has your content.


Hi McKirahan,

Thanks for your reply - alas I need "bits" of their pages, with "bits" of my
stuff inserted in between, so including their whole page as-is unfortunately
is no good for me.

Regards

Rob
May 19 '06 #5
"Mike Brind" wrote ...
If you can identify the specific divs that hold this information (and
they are consistent across pages), you could use regex to parse the
files and pop the relevant bits into a database.


Hi Mike,

Thanks for your reply.

I don't suppose by any chance you might have an example that would get me
started with that approach would you - it sounds like it could well work.

Regards

Rob
May 19 '06 #6
"Anthony Jones" wrote ...
These pages will have been generated via an XSLT transform. The transform
will have made use of these namespaces. However unless informed otherwise
XSLT will output the xmlns tags for these namespaces even though no
element
is output belonging to them which is the case here.

That's a long winded way of saying they don't do anything, ignore them.

It's a pity they didn't go the whole hog and output the whole page as XML
it
would be a lot easier to do what you need. Still it's a good sign that
the
content of the other 1299 pages are likely to be consistent so Mike's idea
of scanning with RegExp should work.


Hi Anthony,

Thanks for the reply.

I especially appreciate the explanation for why they are there - I tried
googling it last night and found some stuff about XSLT 2.0 but it didn't
really get me anywhere - I would agree that it's a shame they are not as
XML - that would have been nice!

Cheers

Rob
May 19 '06 #7
"Mike Brind" <pa*******@hotm ail.com> wrote in message
news:11******** **************@ y43g2000cwc.goo glegroups.com.. .

Rob Meade wrote:
Hi all,

I'm working on a project where there are just under 1300 course files, these are HTML files - my problem is that I need to do more with the content of these pages - and the thought of writing 1300 asp pages to deal with this doesn't thrill me.

The HTML pages are provided by a training company. They seem to be
"structured " to some degree, but I'm not sure how easy its going to be to parse the page.

Typically there are the following "sections" of each page:

Title
Summary
Topics
Technical Requirements
Copyright Information
Terms Of Use


If you can identify the specific divs that hold this information (and
they are consistent across pages), you could use regex to parse the
files and pop the relevant bits into a database.

--
Mike Brind


It would have been nice if each div calss were unquie.
This one is repeated:
<div class="ProductR equirements">
It's not wrong just (potentially) inconvenient.

<td class="Details" >
<div class="ProductR equirements">20 0MHz Pentium ...

<td class="Copyrigh t">
<div class="ProductR equirements">Pr oduct names ...

Which div's are you interested in?
Here's a script that will extract all the div's into a new file:

Option Explicit
'*
Const cVBS = "Novell.vbs "
Const cOT1 = "Novell.htm " '= Input filename
Const cOT2 = "Novell.txt " '= Output filename
Const cDIV = "</div>"
'*
'* Declare Variables
'*
Dim intBEG
intBEG = 1
Dim arrDIV(9)
arrDIV(0) = "<div class=" & Chr(34) & "?" & Chr(34) & ">"
arrDIV(1) = "ProductTit le"
arrDIV(2) = "ProductDetails "
arrDIV(3) = "ProductSummary "
arrDIV(4) = "FreeText"
arrDIV(5) = "ObjectiveL ist"
arrDIV(6) = "OutlineLis t"
arrDIV(7) = "ProductRequire ments"
arrDIV(8) = "ProductRequire ments"
arrDIV(9) = "ProductUsenote "
Dim intDIV
Dim strDIV
Dim arrOT1
Dim intOT1
Dim strOT1
Dim strOT2
Dim intPOS
'*
'* Declare Objects
'*
Dim objFSO
Set objFSO = CreateObject("S cripting.FileSy stemObject")
Dim objOT1
Set objOT1 = objFSO.OpenText File(cOT1,1)
Dim objOT2
Set objOT2 = objFSO.OpenText File(cOT2,2,Tru e)
'*
'* Read File, Extract "div", Write Line
'*
strOT1 = objOT1.ReadAll( )
For intDIV = 1 To UBound(arrDIV)
strOT2 = Mid(strOT1,intB EG)
strDIV = Replace(arrDIV( 0),"?",arrDIV(i ntDIV))
intPOS = InStr(strOT2,st rDIV)
If intPOS > 0 Then
strOT2 = Mid(strOT2,intP OS)
intPOS = InStr(strOT2,cD IV)
strOT2 = Left(strOT2,int POS+Len(cDIV))
objOT2.WriteLin e(strOT2 & vbCrLf)
intBEG = intPOS + Len(cDIV) + 1
End If
Next
'*
'* Destroy Objects
'*
Set objOT1 = Nothing
Set objOT2 = Nothing
Set objFSO = Nothing
'*
'* Done!
'*
MsgBox "Done!",vbInfor mation,cVBS

You could modify it to loop through a list or folder of files.

Note that each "class=" is in the stylesheet:
<link rel="stylesheet " href="../resource/mlcatstyle.css"
type="text/css">
which you should refer to when using their div's.
May 19 '06 #8
"McKirahan" wrote ...

Hi McKirahan, thank you again for your reply and example.

I should add that I wont be writing these out to another file, instead it'll
need to do it on the fly, ie, take the original source page by the code
passed in the URL, read in the appropriate parts, and then spit out my own
layout and extra parts.

With the example you posted (below) - does it extract whats between the DIV
tags, ie the <tr>'s and <td's> as well, or just the actually "text"?

Thanks again

Rob
PS: The copyright one can be excluded..
PPS: When I say its going to happen on the fly, this would obviously depend
on how quick and efficient it is - if it turns out that because of the
number of hits they get on the site in question its a bit too slow, then I
might have to have some kind of "import" process which obviously would make
more sense anyway, this could then create new pages, or perhaps store the
information in the database.
It would have been nice if each div calss were unquie.
This one is repeated:
<div class="ProductR equirements">
It's not wrong just (potentially) inconvenient.

<td class="Details" >
<div class="ProductR equirements">20 0MHz Pentium ...

<td class="Copyrigh t">
<div class="ProductR equirements">Pr oduct names ...

Which div's are you interested in?
Here's a script that will extract all the div's into a new file:

Option Explicit
'*
Const cVBS = "Novell.vbs "
Const cOT1 = "Novell.htm " '= Input filename
Const cOT2 = "Novell.txt " '= Output filename
Const cDIV = "</div>"
'*
'* Declare Variables
'*
Dim intBEG
intBEG = 1
Dim arrDIV(9)
arrDIV(0) = "<div class=" & Chr(34) & "?" & Chr(34) & ">"
arrDIV(1) = "ProductTit le"
arrDIV(2) = "ProductDetails "
arrDIV(3) = "ProductSummary "
arrDIV(4) = "FreeText"
arrDIV(5) = "ObjectiveL ist"
arrDIV(6) = "OutlineLis t"
arrDIV(7) = "ProductRequire ments"
arrDIV(8) = "ProductRequire ments"
arrDIV(9) = "ProductUsenote "
Dim intDIV
Dim strDIV
Dim arrOT1
Dim intOT1
Dim strOT1
Dim strOT2
Dim intPOS
'*
'* Declare Objects
'*
Dim objFSO
Set objFSO = CreateObject("S cripting.FileSy stemObject")
Dim objOT1
Set objOT1 = objFSO.OpenText File(cOT1,1)
Dim objOT2
Set objOT2 = objFSO.OpenText File(cOT2,2,Tru e)
'*
'* Read File, Extract "div", Write Line
'*
strOT1 = objOT1.ReadAll( )
For intDIV = 1 To UBound(arrDIV)
strOT2 = Mid(strOT1,intB EG)
strDIV = Replace(arrDIV( 0),"?",arrDIV(i ntDIV))
intPOS = InStr(strOT2,st rDIV)
If intPOS > 0 Then
strOT2 = Mid(strOT2,intP OS)
intPOS = InStr(strOT2,cD IV)
strOT2 = Left(strOT2,int POS+Len(cDIV))
objOT2.WriteLin e(strOT2 & vbCrLf)
intBEG = intPOS + Len(cDIV) + 1
End If
Next
'*
'* Destroy Objects
'*
Set objOT1 = Nothing
Set objOT2 = Nothing
Set objFSO = Nothing
'*
'* Done!
'*
MsgBox "Done!",vbInfor mation,cVBS

You could modify it to loop through a list or folder of files.

Note that each "class=" is in the stylesheet:
<link rel="stylesheet " href="../resource/mlcatstyle.css"
type="text/css">
which you should refer to when using their div's.

May 19 '06 #9
"Rob Meade" <ku************ ***@edaem.bor> wrote in message
news:e3******** ******@TK2MSFTN GP03.phx.gbl...
"McKirahan" wrote ...

Hi McKirahan, thank you again for your reply and example.

I should add that I wont be writing these out to another file, instead it'll need to do it on the fly, ie, take the original source page by the code
passed in the URL, read in the appropriate parts, and then spit out my own
layout and extra parts.

With the example you posted (below) - does it extract whats between the DIV tags, ie the <tr>'s and <td's> as well, or just the actually "text"?

Thanks again

Rob
PS: The copyright one can be excluded..
PPS: When I say its going to happen on the fly, this would obviously depend on how quick and efficient it is - if it turns out that because of the
number of hits they get on the site in question its a bit too slow, then I
might have to have some kind of "import" process which obviously would make more sense anyway, this could then create new pages, or perhaps store the
information in the database.


Did you try it as-is to see what you get?

I would probably put all 1300 files (pages) in a single folder.
Then run a process against each to generate 1300 new files in
a different folder. These would be posted for quick access.

Prior to posting the could be reviewed for accuracy.

Also, instead of extracting out the div's you could just identify
where you want your stuff inserted.
May 19 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
3349
by: Chad Lupkes | last post by:
I'm just starting in php, although I have a little programming experience in other languages, but only on a surface level. I'm hoping someone can help me. I have a multidimensional array that I want to pull information from on a bunch of pages. $county = array ( array ( key=>"adams", name=>"Adams",
3
3493
by: Mitchua | last post by:
When I run the well quoted line: my $ascii = HTML::FormatText->new->format(HTML::Parse::parse_html($html)); to remove HTML tags from an html document, it replaces all tables with "". Is there a quick and easy way to get the table content parsed too? Thanks a lot, Mitchua
10
6669
by: ST | last post by:
Hi, I'm new to vb.net programming, and I keep getting this error: Name 'GetQuery' is not declared. I can't figure out why?? It seems like I have the right references/namespaces. This is my code below. Also, can someone explain to me what the ContentHandlerImpl.vb file actually does? It's in this project that I'm working with...but I'm...
5
2983
by: jwang | last post by:
I'm currently writing some C code that uses libxml. I've seen several example of parsing xml when the xml are in files. However, I would like to parse the xml from a char buffer. Currently I am creating xml messages and passing it to a server and server response back with xml. I am capturing this data into a buffer (type char) and dumpinging...
1
2361
by: Stephane | last post by:
Hi, I have a html file file that I want to parse with ASP.NET to retreive the value of a custom tag. Let's say that the average html file is about 30 ko. Once the html file is loaded and converted into a single string, I'm using for now is two string.indexOf to find the begin and the end of the desired tag and then a string.substring to...
3
1661
by: Matt Fuerst | last post by:
Hi all, I pre-apologize for the level of stupidity that this message will contain. I nearly guarantee that your IQ will be lowered by the end of this message. Me and a co-worker (I only bring him into this to try to divide the stupidity in half, thus making us each appear only half as dumb as we could, oh wait, will we come off twice as...
13
4167
by: DH | last post by:
Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them from the html page (replace the words in the html page with blank space) I'm new to python and could use a little push in the right direction, any...
4
11619
by: bovanshi | last post by:
got this annoying error I'm completly new to php... and i have no clue what is wrong here, from what i can tell there is nothing rong with this code... but that isn't what the borwser say :P Parse error: parse error, unexpected T_VARIABLE in main.php on line 15 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html> <head>...
9
1600
by: Alec | last post by:
Sorry guys, stupid question.... Am no programming expert and have only just started using php for creating dynamic news pages. Then I see a dynamic website without the php extension. http://www.newcarnet.com/Alfa%20Romeo_news.html?id=8380 It has the html extension that loads the required page dynamically.
0
7664
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7583
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7885
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8106
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7638
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
1
5484
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3642
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3626
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2082
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.