
May 18th, 2006, 09:35 PM
| | | ASP Question: Parse HTML file?
Hi all,
I'm working on a project where there are just under 1300 course files, these
are HTML files - my problem is that I need to do more with the content of
these pages - and the thought of writing 1300 asp pages to deal with this
doesn't thrill me.
The HTML pages are provided by a training company. They seem to be
"structured" to some degree, but I'm not sure how easy its going to be to
parse the page.
Typically there are the following "sections" of each page:
Title
Summary
Topics
Technical Requirements
Copyright Information
Terms Of Use
I need to get the content for the Title, Summary, Topics, Technical
Requirements and lose the Copyright and Terms of use...in addition I need to
squeeze in a new section which will display pricing information and a link
to "Add to cart" etc....
My "plan" (if you can call it that) was to have 1 asp page which can parse
the appropriate HTML file based on the asp page being passed a code in the
querystring - the code will match the filename of the HTML page (the first
part prior to the dot).
What I then need to do is go through the content of the HTML....this is
where I am currently stuck....
I have pasted an example of one of these pages below - if anyone can suggest
to me how I might achieve this I would be most grateful - in addition - if
anyone can explain the XML Name Space stuff in there that would be handy
too - I figure this is just a normal HTML page, as there is no declaration
or anything at the top?
Any information/suggestions would be most appreciated.
Thanks in advance for your help,
Regards
Rob
Example file:
<html>
<head>
<title>Novell 560 CNE Series: File System</title>
<meta name="Description" content="">
<link rel="stylesheet" href="../resource/mlcatstyle.css"
type="text/css">
</head>
<body class="MlCatPage">
<table class="Header" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="Logo" colspan="2">
<img class="Logo" src="../images/logo.gif">
</td>
</tr>
<tr>
<td class="Title">
<div class="ProductTitle">
<span class="CoCat">Novell 560 CNE Series: File System</span>
</div>
<div class="ProductDetails">
<span class="SmallText">
<span class="BoldText"> Product Code: </span>
560c04<span class="BoldText"> Time: </span>
4.0 hour(s)<span class="BoldText"> CEUs: </span>
Available</span>
</div>
</td>
<td class="Back">
<div class="BackButton">
<a href="javascript:history.back()">
<img src="../images/back.gif" align="right" border="0">
</a>
</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="HighLevel" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="sectiontext">Summary:</h3>
</td>
</tr>
<tr>
<td class="Overview">
<div class="ProductSummary">This course provides an introduction
to NetWare 5 file system concepts and management procedures.</div>
<br>
<h3 class="Sectiontext">Objectives:</h3>
<div class="FreeText">After completing this course, students will
be able to: </div>
<div class="ObjectiveList">
<ul class="listing">
<li class="ObjectiveItem">Explain the relationship of the file
system and login scripts</li>
<li class="ObjectiveItem">Create login scripts</li>
<li class="ObjectiveItem">Manage file system directories and
files</li>
<li class="ObjectiveItem">Map network drives</li>
</ul>
</div>
<br></br>
<h3 class="Sectiontext">Topics:</h3>
<div class="OutlineList">
<ul class="listing">
<li class="OutlineItem">Managing the File System</li>
<li class="OutlineItem">Volume Space</li>
<li class="OutlineItem">Examining Login Scripts</li>
<li class="OutlineItem">Creating and Executing Login
Scripts</li>
<li class="OutlineItem">Drive Mappings</li>
<li class="OutlineItem">Login Scripts and Resources</li>
</ul>
</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="Details" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="Sectiontext">Technical Requirements:</h3>
</td>
</tr>
<tr>
<td class="Details">
<div class="ProductRequirements">200MHz Pentium with 32MB Ram. 800
x 600 minimum screen resolution. Windows 98, NT, 2000, or XP. 56K minimum
connection speed, broadband (256 kbps or greater) connection recommended.
Internet Explorer 5.0 or higher required. Flash Player 7.0 or higher
required. JavaScript must be enabled. Netscape, Firefox and AOL browsers not
supported.</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="Legal" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="Sectiontext">Copyright Information:</h3>
</td>
</tr>
<tr>
<td class="Copyright">
<div class="ProductRequirements">Product names mentioned in this
catalog may be trademarks/servicemarks or registered trademarks/servicemarks
of their respective companies and are hereby acknowledged. All product
names that are known to be trademarks or service marks have been
appropriately capitalized. Use of a name in this catalog is for
identification purposes only, and should not be regarded as affecting the
validity of any trademark or service mark, or as suggesting any affiliation
between MindLeaders.com, Inc. and the trademark/servicemark
proprietor.</div>
<br>
<h3 class="Sectiontext">Terms of Use:</h3>
<div class="ProductUsenote"></div>
</td>
</tr>
</table>
<p align="center">
<span class="SmallText">Copyright © 2006 MindLeaders. All rights
reserved.</span>
</p>
</body>
</html> | 
May 18th, 2006, 10:25 PM
| | | Re: ASP Question: Parse HTML file?
Rob Meade wrote:[color=blue]
> Hi all,
>
> I'm working on a project where there are just under 1300 course files, these
> are HTML files - my problem is that I need to do more with the content of
> these pages - and the thought of writing 1300 asp pages to deal with this
> doesn't thrill me.
>
> The HTML pages are provided by a training company. They seem to be
> "structured" to some degree, but I'm not sure how easy its going to be to
> parse the page.
>
> Typically there are the following "sections" of each page:
>
> Title
> Summary
> Topics
> Technical Requirements
> Copyright Information
> Terms Of Use[/color]
If you can identify the specific divs that hold this information (and
they are consistent across pages), you could use regex to parse the
files and pop the relevant bits into a database.
--
Mike Brind | 
May 18th, 2006, 11:05 PM
| | | Re: ASP Question: Parse HTML file?
[color=blue]
>
> I have pasted an example of one of these pages below - if anyone can[/color]
suggest[color=blue]
> to me how I might achieve this I would be most grateful - in addition - if
> anyone can explain the XML Name Space stuff in there that would be handy
> too - I figure this is just a normal HTML page, as there is no declaration
> or anything at the top?
>[/color]
These pages will have been generated via an XSLT transform. The transform
will have made use of these namespaces. However unless informed otherwise
XSLT will output the xmlns tags for these namespaces even though no element
is output belonging to them which is the case here.
That's a long winded way of saying they don't do anything, ignore them.
It's a pity they didn't go the whole hog and output the whole page as XML it
would be a lot easier to do what you need. Still it's a good sign that the
content of the other 1299 pages are likely to be consistent so Mike's idea
of scanning with RegExp should work.
Anthony. | 
May 18th, 2006, 11:55 PM
| | | Re: ASP Question: Parse HTML file?
"Rob Meade" <robb.meade@NO-SPAM.kingswoodweb.net> wrote in message
news:PS4bg.71560$wl.3777@text.news.blueyonder.co.u k...[color=blue]
> Hi all,
>
> I'm working on a project where there are just under 1300 course files,[/color]
these[color=blue]
> are HTML files - my problem is that I need to do more with the content of
> these pages - and the thought of writing 1300 asp pages to deal with this
> doesn't thrill me.
>
> The HTML pages are provided by a training company. They seem to be
> "structured" to some degree, but I'm not sure how easy its going to be to
> parse the page.
>
> Typically there are the following "sections" of each page:
>
> Title
> Summary
> Topics
> Technical Requirements
> Copyright Information
> Terms Of Use
>
> I need to get the content for the Title, Summary, Topics, Technical
> Requirements and lose the Copyright and Terms of use...in addition I need[/color]
to[color=blue]
> squeeze in a new section which will display pricing information and a link
> to "Add to cart" etc....
>
> My "plan" (if you can call it that) was to have 1 asp page which can parse
> the appropriate HTML file based on the asp page being passed a code in the
> querystring - the code will match the filename of the HTML page (the first
> part prior to the dot).
>
> What I then need to do is go through the content of the HTML....this is
> where I am currently stuck....
>
> I have pasted an example of one of these pages below - if anyone can[/color]
suggest[color=blue]
> to me how I might achieve this I would be most grateful - in addition - if
> anyone can explain the XML Name Space stuff in there that would be handy
> too - I figure this is just a normal HTML page, as there is no declaration
> or anything at the top?
>
> Any information/suggestions would be most appreciated.[/color]
[snip]
Consider displaying their page inside of an <iframe>
inside of a page that has your content.
"The iframe element creates an inline frame that contains another document." http://www.w3schools.com/tags/tag_iframe.asp | 
May 19th, 2006, 08:45 AM
| | | Re: ASP Question: Parse HTML file?
"McKirahan" wrote ...
[color=blue]
> Consider displaying their page inside of an <iframe>
> inside of a page that has your content.[/color]
Hi McKirahan,
Thanks for your reply - alas I need "bits" of their pages, with "bits" of my
stuff inserted in between, so including their whole page as-is unfortunately
is no good for me.
Regards
Rob | 
May 19th, 2006, 08:55 AM
| | | Re: ASP Question: Parse HTML file?
"Mike Brind" wrote ...
[color=blue]
> If you can identify the specific divs that hold this information (and
> they are consistent across pages), you could use regex to parse the
> files and pop the relevant bits into a database.[/color]
Hi Mike,
Thanks for your reply.
I don't suppose by any chance you might have an example that would get me
started with that approach would you - it sounds like it could well work.
Regards
Rob | 
May 19th, 2006, 08:55 AM
| | | Re: ASP Question: Parse HTML file?
"Anthony Jones" wrote ...
[color=blue]
> These pages will have been generated via an XSLT transform. The transform
> will have made use of these namespaces. However unless informed otherwise
> XSLT will output the xmlns tags for these namespaces even though no
> element
> is output belonging to them which is the case here.
>
> That's a long winded way of saying they don't do anything, ignore them.
>
> It's a pity they didn't go the whole hog and output the whole page as XML
> it
> would be a lot easier to do what you need. Still it's a good sign that
> the
> content of the other 1299 pages are likely to be consistent so Mike's idea
> of scanning with RegExp should work.[/color]
Hi Anthony,
Thanks for the reply.
I especially appreciate the explanation for why they are there - I tried
googling it last night and found some stuff about XSLT 2.0 but it didn't
really get me anywhere - I would agree that it's a shame they are not as
XML - that would have been nice!
Cheers
Rob | 
May 19th, 2006, 01:25 PM
| | | Re: ASP Question: Parse HTML file?
"Mike Brind" <paxtonend@hotmail.com> wrote in message
news:1147987306.908257.128550@y43g2000cwc.googlegr oups.com...[color=blue]
>
> Rob Meade wrote:[color=green]
> > Hi all,
> >
> > I'm working on a project where there are just under 1300 course files,[/color][/color]
these[color=blue][color=green]
> > are HTML files - my problem is that I need to do more with the content[/color][/color]
of[color=blue][color=green]
> > these pages - and the thought of writing 1300 asp pages to deal with[/color][/color]
this[color=blue][color=green]
> > doesn't thrill me.
> >
> > The HTML pages are provided by a training company. They seem to be
> > "structured" to some degree, but I'm not sure how easy its going to be[/color][/color]
to[color=blue][color=green]
> > parse the page.
> >
> > Typically there are the following "sections" of each page:
> >
> > Title
> > Summary
> > Topics
> > Technical Requirements
> > Copyright Information
> > Terms Of Use[/color]
>
> If you can identify the specific divs that hold this information (and
> they are consistent across pages), you could use regex to parse the
> files and pop the relevant bits into a database.
>
> --
> Mike Brind
>[/color]
It would have been nice if each div calss were unquie.
This one is repeated:
<div class="ProductRequirements">
It's not wrong just (potentially) inconvenient.
<td class="Details">
<div class="ProductRequirements">200MHz Pentium ...
<td class="Copyright">
<div class="ProductRequirements">Product names ...
Which div's are you interested in?
Here's a script that will extract all the div's into a new file:
Option Explicit
'*
Const cVBS = "Novell.vbs"
Const cOT1 = "Novell.htm" '= Input filename
Const cOT2 = "Novell.txt" '= Output filename
Const cDIV = "</div>"
'*
'* Declare Variables
'*
Dim intBEG
intBEG = 1
Dim arrDIV(9)
arrDIV(0) = "<div class=" & Chr(34) & "?" & Chr(34) & ">"
arrDIV(1) = "ProductTitle"
arrDIV(2) = "ProductDetails"
arrDIV(3) = "ProductSummary"
arrDIV(4) = "FreeText"
arrDIV(5) = "ObjectiveList"
arrDIV(6) = "OutlineList"
arrDIV(7) = "ProductRequirements"
arrDIV(8) = "ProductRequirements"
arrDIV(9) = "ProductUsenote"
Dim intDIV
Dim strDIV
Dim arrOT1
Dim intOT1
Dim strOT1
Dim strOT2
Dim intPOS
'*
'* Declare Objects
'*
Dim objFSO
Set objFSO = CreateObject("Scripting.FileSystemObject")
Dim objOT1
Set objOT1 = objFSO.OpenTextFile(cOT1,1)
Dim objOT2
Set objOT2 = objFSO.OpenTextFile(cOT2,2,True)
'*
'* Read File, Extract "div", Write Line
'*
strOT1 = objOT1.ReadAll()
For intDIV = 1 To UBound(arrDIV)
strOT2 = Mid(strOT1,intBEG)
strDIV = Replace(arrDIV(0),"?",arrDIV(intDIV))
intPOS = InStr(strOT2,strDIV)
If intPOS > 0 Then
strOT2 = Mid(strOT2,intPOS)
intPOS = InStr(strOT2,cDIV)
strOT2 = Left(strOT2,intPOS+Len(cDIV))
objOT2.WriteLine(strOT2 & vbCrLf)
intBEG = intPOS + Len(cDIV) + 1
End If
Next
'*
'* Destroy Objects
'*
Set objOT1 = Nothing
Set objOT2 = Nothing
Set objFSO = Nothing
'*
'* Done!
'*
MsgBox "Done!",vbInformation,cVBS
You could modify it to loop through a list or folder of files.
Note that each "class=" is in the stylesheet:
<link rel="stylesheet" href="../resource/mlcatstyle.css"
type="text/css">
which you should refer to when using their div's. | 
May 19th, 2006, 03:35 PM
| | | Re: ASP Question: Parse HTML file?
"McKirahan" wrote ...
Hi McKirahan, thank you again for your reply and example.
I should add that I wont be writing these out to another file, instead it'll
need to do it on the fly, ie, take the original source page by the code
passed in the URL, read in the appropriate parts, and then spit out my own
layout and extra parts.
With the example you posted (below) - does it extract whats between the DIV
tags, ie the <tr>'s and <td's> as well, or just the actually "text"?
Thanks again
Rob
PS: The copyright one can be excluded..
PPS: When I say its going to happen on the fly, this would obviously depend
on how quick and efficient it is - if it turns out that because of the
number of hits they get on the site in question its a bit too slow, then I
might have to have some kind of "import" process which obviously would make
more sense anyway, this could then create new pages, or perhaps store the
information in the database.
[color=blue]
> It would have been nice if each div calss were unquie.
> This one is repeated:
> <div class="ProductRequirements">
> It's not wrong just (potentially) inconvenient.
>
> <td class="Details">
> <div class="ProductRequirements">200MHz Pentium ...
>
> <td class="Copyright">
> <div class="ProductRequirements">Product names ...
>
> Which div's are you interested in?
>
>
> Here's a script that will extract all the div's into a new file:
>
> Option Explicit
> '*
> Const cVBS = "Novell.vbs"
> Const cOT1 = "Novell.htm" '= Input filename
> Const cOT2 = "Novell.txt" '= Output filename
> Const cDIV = "</div>"
> '*
> '* Declare Variables
> '*
> Dim intBEG
> intBEG = 1
> Dim arrDIV(9)
> arrDIV(0) = "<div class=" & Chr(34) & "?" & Chr(34) & ">"
> arrDIV(1) = "ProductTitle"
> arrDIV(2) = "ProductDetails"
> arrDIV(3) = "ProductSummary"
> arrDIV(4) = "FreeText"
> arrDIV(5) = "ObjectiveList"
> arrDIV(6) = "OutlineList"
> arrDIV(7) = "ProductRequirements"
> arrDIV(8) = "ProductRequirements"
> arrDIV(9) = "ProductUsenote"
> Dim intDIV
> Dim strDIV
> Dim arrOT1
> Dim intOT1
> Dim strOT1
> Dim strOT2
> Dim intPOS
> '*
> '* Declare Objects
> '*
> Dim objFSO
> Set objFSO = CreateObject("Scripting.FileSystemObject")
> Dim objOT1
> Set objOT1 = objFSO.OpenTextFile(cOT1,1)
> Dim objOT2
> Set objOT2 = objFSO.OpenTextFile(cOT2,2,True)
> '*
> '* Read File, Extract "div", Write Line
> '*
> strOT1 = objOT1.ReadAll()
> For intDIV = 1 To UBound(arrDIV)
> strOT2 = Mid(strOT1,intBEG)
> strDIV = Replace(arrDIV(0),"?",arrDIV(intDIV))
> intPOS = InStr(strOT2,strDIV)
> If intPOS > 0 Then
> strOT2 = Mid(strOT2,intPOS)
> intPOS = InStr(strOT2,cDIV)
> strOT2 = Left(strOT2,intPOS+Len(cDIV))
> objOT2.WriteLine(strOT2 & vbCrLf)
> intBEG = intPOS + Len(cDIV) + 1
> End If
> Next
> '*
> '* Destroy Objects
> '*
> Set objOT1 = Nothing
> Set objOT2 = Nothing
> Set objFSO = Nothing
> '*
> '* Done!
> '*
> MsgBox "Done!",vbInformation,cVBS
>
> You could modify it to loop through a list or folder of files.
>
> Note that each "class=" is in the stylesheet:
> <link rel="stylesheet" href="../resource/mlcatstyle.css"
> type="text/css">
> which you should refer to when using their div's.[/color] | 
May 19th, 2006, 10:55 PM
| | | Re: ASP Question: Parse HTML file?
"Rob Meade" <ku.shn.tsews.thbu@edaem.bor> wrote in message
news:e3WJh$0eGHA.3640@TK2MSFTNGP03.phx.gbl...[color=blue]
> "McKirahan" wrote ...
>
> Hi McKirahan, thank you again for your reply and example.
>
> I should add that I wont be writing these out to another file, instead[/color]
it'll[color=blue]
> need to do it on the fly, ie, take the original source page by the code
> passed in the URL, read in the appropriate parts, and then spit out my own
> layout and extra parts.
>
> With the example you posted (below) - does it extract whats between the[/color]
DIV[color=blue]
> tags, ie the <tr>'s and <td's> as well, or just the actually "text"?
>
> Thanks again
>
> Rob
> PS: The copyright one can be excluded..
> PPS: When I say its going to happen on the fly, this would obviously[/color]
depend[color=blue]
> on how quick and efficient it is - if it turns out that because of the
> number of hits they get on the site in question its a bit too slow, then I
> might have to have some kind of "import" process which obviously would[/color]
make[color=blue]
> more sense anyway, this could then create new pages, or perhaps store the
> information in the database.
>[/color]
Did you try it as-is to see what you get?
I would probably put all 1300 files (pages) in a single folder.
Then run a process against each to generate 1300 new files in
a different folder. These would be posted for quick access.
Prior to posting the could be reviewed for accuracy.
Also, instead of extracting out the div's you could just identify
where you want your stuff inserted. | 
May 20th, 2006, 11:15 AM
| | | Re: ASP Question: Parse HTML file?
"McKirahan" wrote ...
[color=blue]
> Did you try it as-is to see what you get?[/color]
Hi McKirahan, thanks for your reply.
Not as of yet no - but I'm home this weekend so will be giving it ago :o)
[color=blue]
> I would probably put all 1300 files (pages) in a single folder.[/color]
They come in a /courses directory
[color=blue]
> Then run a process against each to generate 1300 new files in
> a different folder. These would be posted for quick access.[/color]
I think I might have to change the process a bit but the idea is the same -
the content provider has other bits that link to these files, so they'd
still need to be in a /courses directory, but I could put them somewhere
else first, "mangle" them and then spit them out to the /courses directory
:o)
[color=blue]
> Prior to posting the could be reviewed for accuracy.[/color]
I might check a couple - but not all 1300 - I dont wanna go mental... :oD
[color=blue]
> Also, instead of extracting out the div's you could just identify
> where you want your stuff inserted.[/color]
Yeah, but there were bits I needed to lose, ie the copyright section etc..
I seem to remember a long time back a discussion about transforming pages, I
think it might have been done in an ISAPI filter or something though - not
sure - from what I remember the requested page would get grabbed, actions
happen and then it can be spat out as a different page - I wonder if this is
what the previous company that did this adopted, because I find it hard to
believe they would have created 1300 asp files, but yet all of the links on
the original site were <course-code>.asp as opposed to the real file
<course-code.html - if you see what I mean...
Regards
Rob | 
May 20th, 2006, 02:25 PM
| | | Re: ASP Question: Parse HTML file?
"Rob Meade" <ten.bewdoowsgnikNO-SPAM@edaem.bbor> wrote in message
news:cTBbg.72279$wl.54580@text.news.blueyonder.co. uk...
[snip]
[color=blue]
> I seem to remember a long time back a discussion about transforming pages,[/color]
I[color=blue]
> think it might have been done in an ISAPI filter or something though - not
> sure - from what I remember the requested page would get grabbed, actions
> happen and then it can be spat out as a different page - I wonder if this[/color]
is[color=blue]
> what the previous company that did this adopted, because I find it hard to
> believe they would have created 1300 asp files, but yet all of the links[/color]
on[color=blue]
> the original site were <course-code>.asp as opposed to the real file
> <course-code.html - if you see what I mean...[/color]
An approach they could have taken was to store the "sections" in a database
table -- one memo field per section -- then generate static pages from it.
Thus, the header, navigation, and footer could be modified independently. | 
May 21st, 2006, 04:45 PM
| | | Re: ASP Question: Parse HTML file?
"McKirahan" wrote ...
[color=blue]
> An approach they could have taken was to store the "sections" in a
> database
> table -- one memo field per section -- then generate static pages from it.
>
> Thus, the header, navigation, and footer could be modified independently.[/color]
I suspect the company does have this, but they most likely use it for the
generation of these files which they then sell on etc...
The one thing I do have missing at the moment is a nice file that ties the
<course_code.html> file names (or just the codes) - to the titles of the
courses!
They give you a "contents.html" file which has all of the courses listed and
the codes / files as hyperlinks - but again it would mean parsing the entire
file to get at the goodies, I'm going to ask them if they have the same
thing in XML/Database or something to hopefully make that a bit easier..
Thanks again for your help - alas due to my 9 month old son I have yet to
get around to trying your example! But I will :o)
Rob | 
May 21st, 2006, 09:05 PM
| | | Re: ASP Question: Parse HTML file?
When you do get to try Rob's code, you will see that it opens a number
of possibilities - one of which is to insert the contents of the divs
into an database instead of writing them to 1300 text files. I really
can't understand why this is not at the top of your list of options -
manage 1300 files...? or manage 1? Hmmmm.... But then you obviously
know a lot more about your project then I do :-)
If you were using Rob's code, you can insert this into it:
If intDiv = 2 Then
Dim re, m, myMatches, pcode
Set re = New RegExp
With re
.Pattern = "Product Code: </span>[\s]+[\n]+[\s]+([a-z0-9]{6})"
.IgnoreCase = True
.Global = True
End With
Set myMatches = re.Execute(strOT2)
For Each m In myMatches
If m.Value <>"" Then
pcode = Replace(m.Value,"Product Code: </span>","")
pcode = Replace(pcode," ","")
pcode = Replace(pcode,chr(10),"")
pcode = Replace(pcode,chr(13),"")
Response.Write pcode 'or write to db
End If
Next
Set re = Nothing
End If
And that will return the Product Code on it's own. Change the pattern
to "<title>[\.]*</title>" and you get the title stripped out too.
--
Mike Brind
Rob Meade wrote:[color=blue]
> "McKirahan" wrote ...
>[color=green]
> > An approach they could have taken was to store the "sections" in a
> > database
> > table -- one memo field per section -- then generate static pages from it.
> >
> > Thus, the header, navigation, and footer could be modified independently.[/color]
>
> I suspect the company does have this, but they most likely use it for the
> generation of these files which they then sell on etc...
>
> The one thing I do have missing at the moment is a nice file that ties the
> <course_code.html> file names (or just the codes) - to the titles of the
> courses!
>
> They give you a "contents.html" file which has all of the courses listed and
> the codes / files as hyperlinks - but again it would mean parsing the entire
> file to get at the goodies, I'm going to ask them if they have the same
> thing in XML/Database or something to hopefully make that a bit easier..
>
> Thanks again for your help - alas due to my 9 month old son I have yet to
> get around to trying your example! But I will :o)
>
> Rob[/color] | 
May 22nd, 2006, 08:05 AM
| | | Re: ASP Question: Parse HTML file?
"Mike Brind" wrote ...
[color=blue]
> When you do get to try Rob's code, you will see that it opens a number
> of possibilities - one of which is to insert the contents of the divs
> into an database instead of writing them to 1300 text files. I really
> can't understand why this is not at the top of your list of options -
> manage 1300 files...? or manage 1? Hmmmm.... But then you obviously
> know a lot more about your project then I do :-)
>
> If you were using Rob's code, you can insert this into it:
>
> If intDiv = 2 Then
> Dim re, m, myMatches, pcode
> Set re = New RegExp
> With re
> .Pattern = "Product Code: </span>[\s]+[\n]+[\s]+([a-z0-9]{6})"
> .IgnoreCase = True
> .Global = True
> End With
> Set myMatches = re.Execute(strOT2)
> For Each m In myMatches
> If m.Value <>"" Then
> pcode = Replace(m.Value,"Product Code: </span>","")
> pcode = Replace(pcode," ","")
> pcode = Replace(pcode,chr(10),"")
> pcode = Replace(pcode,chr(13),"")
> Response.Write pcode 'or write to db
> End If
> Next
> Set re = Nothing
> End If
>
> And that will return the Product Code on it's own. Change the pattern
> to "<title>[\.]*</title>" and you get the title stripped out too.[/color]
Hi Mike,
Thanks for your reply - something else to try with it - very much
appreciated, thank you.
Regards
Rob
PS: It's McKirahan's code ;o) | | Thread Tools | Search this Thread | | | |
Posting Rules
| You may not post new threads You may not post replies You may not post attachments You may not edit your posts HTML code is Off | | | | | | What is Bytes?
We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights.
Get the best answers to your questions from over 205,248 network members.
|