Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old July 24th, 2005, 12:56 AM
mark126@gmail.com
Guest
 
Posts: n/a
Default Flatten a collection of HTML files into one

If this is not the right group for this question, please advise me of a
better one.

I have a collection of simple HTML files, many of which just contain a
paragraph or two of text. Some contain just an IMG and a one line
caption. I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page. If that BODY contains further
links, it should be similarly processed. (I don't care if the file is
processed recursively or if I have to run several passes of the tool.)

For example, suppose I have 4 files like so (I'm not showing the HTML
tags):

toc.html
-------
Table of Contents
Article 1
[link to article1.html]
Article 2
[link to article2.html]


article1.html
-----------
This is article 1. It is very short.

article2.html
-----------
This is article 2. It contains a tip.
[link to tip.html]

tip.html
-------
Don't put things in your ears.


I want to be able to process toc.html and end up with an HTML file like
this:

newtoc.html
----------
Table of Contents
Article 1
This is article 1. It is very short.
Article 2
This is article 2. It contains a tip.
Don't put things in your ears.


In my case, the nesting goes a few levels deeper. Some articles have 20
categories of tips, each category page has one or more tips linked by
the title of the tip. I would like to retain the HTML markup (it is
just very simple headings, paragraphs, italics, etc.) so that I can
process the combined file with html2latex and end up with a nice
looking PDF I can print and read away from a computer.

I am sure I am not the first person wanting to do something like this,
but so far I have not been able to come up with the right input to
Google to find a premade tool.

Mark

  #2  
Old July 24th, 2005, 12:56 AM
Els
Guest
 
Posts: n/a
Default Re: Flatten a collection of HTML files into one

mark126@gmail.com wrote:
[color=blue]
> If this is not the right group for this question, please advise me of a
> better one.
>
> I have a collection of simple HTML files, many of which just contain a
> paragraph or two of text. Some contain just an IMG and a one line
> caption. I would like to find a tool that will load the first page
> (call it toc.html) and anywhere it finds a link, it should replace the
> link with the BODY of the linked page. If that BODY contains further
> links, it should be similarly processed. (I don't care if the file is
> processed recursively or if I have to run several passes of the tool.)
>
> For example, suppose I have 4 files like so (I'm not showing the HTML
> tags):
>
> toc.html
> -------
> Table of Contents
> Article 1
> [link to article1.html]
> Article 2
> [link to article2.html]
>
>
> article1.html
> -----------
> This is article 1. It is very short.
>
> article2.html
> -----------
> This is article 2. It contains a tip.
> [link to tip.html]
>
> tip.html
> -------
> Don't put things in your ears.
>
>
> I want to be able to process toc.html and end up with an HTML file like
> this:
>
> newtoc.html
> ----------
> Table of Contents
> Article 1
> This is article 1. It is very short.
> Article 2
> This is article 2. It contains a tip.
> Don't put things in your ears.
>
>
> In my case, the nesting goes a few levels deeper. Some articles have 20
> categories of tips, each category page has one or more tips linked by
> the title of the tip. I would like to retain the HTML markup (it is
> just very simple headings, paragraphs, italics, etc.) so that I can
> process the combined file with html2latex and end up with a nice
> looking PDF I can print and read away from a computer.
>
> I am sure I am not the first person wanting to do something like this,
> but so far I have not been able to come up with the right input to
> Google to find a premade tool.[/color]

Do you really want to use links, as in <a href=...>, or are you just
looking for a method? If you have PHP installed on the server, you can
just call these other files like so: <?php include "filename.ext" ?>

Any serverside language will do though, and then there's Server Side
Includes (SSI) (which I know nothing about :-) )

--
Els http://locusmeus.com/
Sonhos vem. Sonhos vão. O resto é imperfeito.
- Renato Russo -
Now playing: Pearl Jam - Dirty Frank
  #3  
Old July 24th, 2005, 12:56 AM
mark126@gmail.com
Guest
 
Posts: n/a
Default Re: Flatten a collection of HTML files into one

I already have the files, and they do use links. I didn't make them
this way -- I'm just trying to make the best of a bad situation. I
think they were probably made this way because they are from the early
1990s and were probably accessed over slow dialup connections.

I forgot to mention in my first post that I am using Mac OS X, but I
have access to Linux, Windows, and DOS platforms too.

Mark

  #4  
Old July 24th, 2005, 12:56 AM
Els
Guest
 
Posts: n/a
Default Re: Flatten a collection of HTML files into one

mark126@gmail.com wrote:
[color=blue]
> I already have the files, and they do use links. I didn't make them
> this way -- I'm just trying to make the best of a bad situation. I
> think they were probably made this way because they are from the early
> 1990s and were probably accessed over slow dialup connections.
>
> I forgot to mention in my first post that I am using Mac OS X, but I
> have access to Linux, Windows, and DOS platforms too.[/color]

I have a good memory, and I do remember my post of half an hour ago,
and even yours (although not literally). But not everybody who sees
your message has seen or remembered the previous one in the thread.
So, please quote the relevant bits of the post you are replying to,
and reply underneath.

Back to your question: am I understanding you correctly, that
basically you want to change a bunch of regular links to in-page
anchor links?

I think with a bit of good thinking and some regex in combination with
includes you can actually do that. Personally I'd call a friend with
programming skills to do it for me ;-)

If you just want to have them all in one file, as if the links were
replaced by the files, then why not just replace
<a href="pageX.html">Page X</a> with <?php include "pageX.html" ?> ?

--
Els http://locusmeus.com/
Sonhos vem. Sonhos vão. O resto é imperfeito.
- Renato Russo -
Now playing: Pearl Jam - Alive (live)
  #5  
Old July 24th, 2005, 12:56 AM
Andy Dingley
Guest
 
Posts: n/a
Default Re: Flatten a collection of HTML files into one

On 24 Mar 2005 11:39:36 -0800, mark126@gmail.com wrote:
[color=blue]
>I would like to find a tool that will load the first page
>(call it toc.html) and anywhere it finds a link, it should replace the
>link with the BODY of the linked page.[/color]

This is one reason I always author as XHTML. This would be pretty
easy with XSLT.

You should be able to do it with Perl, or most other scripting
languages. The ease of doing it depends on how "clean" the original
code is.

  #6  
Old July 24th, 2005, 12:56 AM
Pierre Goiffon
Guest
 
Posts: n/a
Default Re: Flatten a collection of HTML files into one

Andy Dingley wrote:[color=blue]
> This is one reason I always author as XHTML. This would be pretty
> easy with XSLT.[/color]

And this could be done also very easily (I should say - in an easier way
:o) ) using HTML and any programmation language including a RegExp or
DOM API, and there's a lot of them.
  #7  
Old July 24th, 2005, 12:56 AM
Nick Kew
Guest
 
Posts: n/a
Default Re: Flatten a collection of HTML files into one

mark126@gmail.com wrote:[color=blue]
> I would like to find a tool that will load the first page
> (call it toc.html) and anywhere it finds a link, it should replace the
> link with the BODY of the linked page.[/color]

What you describe is the third example on the page describing markup
macros in mod_publisher. At its simplest you'd use

MLMacro a replace url @href

to replace all <a ...> links with the contents of a page referenced in
the href attribute.

Don't forget that if you're inserting HTML, you need to preprocess it
to remove everything that isn't body contents. To do that you'd apply
several macros to the included page:

MLMacro html replace start ""
MLMacro html replace end ""
MLMacro head hide
MLMacro body replace start "<div class=\"included\">"
MLMacro body replace end </div>

If you're processing badly broken markup, you might also need
to apply MLExtendedFixups. But for anything half-decent, the
above should be sufficient.

http://apache.webthing.com/mod_publisher/

--
Nick Kew
  #8  
Old July 24th, 2005, 12:56 AM
Fuzzyman
Guest
 
Posts: n/a
Default Re: Flatten a collection of HTML files into one


mark126@gmail.com wrote:[color=blue]
> If this is not the right group for this question, please advise me of[/color]
a[color=blue]
> better one.
>
> I have a collection of simple HTML files, many of which just contain[/color]
a[color=blue]
> paragraph or two of text. Some contain just an IMG and a one line
> caption. I would like to find a tool that will load the first page
> (call it toc.html) and anywhere it finds a link, it should replace[/color]
the[color=blue]
> link with the BODY of the linked page. If that BODY contains further
> links, it should be similarly processed. (I don't care if the file is
> processed recursively or if I have to run several passes of the[/color]
tool.)[color=blue]
>[/color]

This would definitely be (reasonably) easy in most modern scripting
languages. My personal favourite is Python. There is a Python HTML
parser called beautiful soup that would almost make this trivial.

Are the links relative (i.e. should they be loaded from a filesystem)
or absolute URLs (loaded from the internet) ?

I've already used BeautifulSoup to write a link checker that crawls all
URLs within a single domain. See
http://www.viodspace.org.uk/python/programs.shtml

I can ahck it over the weekend to do what you need.... You'll need to
wait until Tuesday though - I'm off the internet until then. The first
version will fetch files from a local filesystem and just insert the
contents of the BODY tag (recursively) instead of the link. If you want
any additional processing we can discuss it.

Regards,

Fuzzy
http://www.voidspace.org.uk/python

[snip..]

  #9  
Old July 24th, 2005, 01:00 AM
Fuzzyman
Guest
 
Posts: n/a
Default Re: Flatten a collection of HTML files into one

I'm still happy to do this, if it's actually needed.

Best Regards,

Fuzzy
http://www.voidspace.org.uk/python

 

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles