473,385 Members | 1,727 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Flatten a collection of HTML files into one

If this is not the right group for this question, please advise me of a
better one.

I have a collection of simple HTML files, many of which just contain a
paragraph or two of text. Some contain just an IMG and a one line
caption. I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page. If that BODY contains further
links, it should be similarly processed. (I don't care if the file is
processed recursively or if I have to run several passes of the tool.)

For example, suppose I have 4 files like so (I'm not showing the HTML
tags):

toc.html
-------
Table of Contents
Article 1
[link to article1.html]
Article 2
[link to article2.html]
article1.html
-----------
This is article 1. It is very short.

article2.html
-----------
This is article 2. It contains a tip.
[link to tip.html]

tip.html
-------
Don't put things in your ears.
I want to be able to process toc.html and end up with an HTML file like
this:

newtoc.html
----------
Table of Contents
Article 1
This is article 1. It is very short.
Article 2
This is article 2. It contains a tip.
Don't put things in your ears.
In my case, the nesting goes a few levels deeper. Some articles have 20
categories of tips, each category page has one or more tips linked by
the title of the tip. I would like to retain the HTML markup (it is
just very simple headings, paragraphs, italics, etc.) so that I can
process the combined file with html2latex and end up with a nice
looking PDF I can print and read away from a computer.

I am sure I am not the first person wanting to do something like this,
but so far I have not been able to come up with the right input to
Google to find a premade tool.

Mark

Jul 23 '05 #1
8 2394
Els
ma*****@gmail.com wrote:
If this is not the right group for this question, please advise me of a
better one.

I have a collection of simple HTML files, many of which just contain a
paragraph or two of text. Some contain just an IMG and a one line
caption. I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page. If that BODY contains further
links, it should be similarly processed. (I don't care if the file is
processed recursively or if I have to run several passes of the tool.)

For example, suppose I have 4 files like so (I'm not showing the HTML
tags):

toc.html
-------
Table of Contents
Article 1
[link to article1.html]
Article 2
[link to article2.html]
article1.html
-----------
This is article 1. It is very short.

article2.html
-----------
This is article 2. It contains a tip.
[link to tip.html]

tip.html
-------
Don't put things in your ears.
I want to be able to process toc.html and end up with an HTML file like
this:

newtoc.html
----------
Table of Contents
Article 1
This is article 1. It is very short.
Article 2
This is article 2. It contains a tip.
Don't put things in your ears.
In my case, the nesting goes a few levels deeper. Some articles have 20
categories of tips, each category page has one or more tips linked by
the title of the tip. I would like to retain the HTML markup (it is
just very simple headings, paragraphs, italics, etc.) so that I can
process the combined file with html2latex and end up with a nice
looking PDF I can print and read away from a computer.

I am sure I am not the first person wanting to do something like this,
but so far I have not been able to come up with the right input to
Google to find a premade tool.


Do you really want to use links, as in <a href=...>, or are you just
looking for a method? If you have PHP installed on the server, you can
just call these other files like so: <?php include "filename.ext" ?>

Any serverside language will do though, and then there's Server Side
Includes (SSI) (which I know nothing about :-) )

--
Els http://locusmeus.com/
Sonhos vem. Sonhos vão. O resto é imperfeito.
- Renato Russo -
Now playing: Pearl Jam - Dirty Frank
Jul 23 '05 #2
I already have the files, and they do use links. I didn't make them
this way -- I'm just trying to make the best of a bad situation. I
think they were probably made this way because they are from the early
1990s and were probably accessed over slow dialup connections.

I forgot to mention in my first post that I am using Mac OS X, but I
have access to Linux, Windows, and DOS platforms too.

Mark

Jul 23 '05 #3
Els
ma*****@gmail.com wrote:
I already have the files, and they do use links. I didn't make them
this way -- I'm just trying to make the best of a bad situation. I
think they were probably made this way because they are from the early
1990s and were probably accessed over slow dialup connections.

I forgot to mention in my first post that I am using Mac OS X, but I
have access to Linux, Windows, and DOS platforms too.


I have a good memory, and I do remember my post of half an hour ago,
and even yours (although not literally). But not everybody who sees
your message has seen or remembered the previous one in the thread.
So, please quote the relevant bits of the post you are replying to,
and reply underneath.

Back to your question: am I understanding you correctly, that
basically you want to change a bunch of regular links to in-page
anchor links?

I think with a bit of good thinking and some regex in combination with
includes you can actually do that. Personally I'd call a friend with
programming skills to do it for me ;-)

If you just want to have them all in one file, as if the links were
replaced by the files, then why not just replace
<a href="pageX.html">Page X</a> with <?php include "pageX.html" ?> ?

--
Els http://locusmeus.com/
Sonhos vem. Sonhos vão. O resto é imperfeito.
- Renato Russo -
Now playing: Pearl Jam - Alive (live)
Jul 23 '05 #4
On 24 Mar 2005 11:39:36 -0800, ma*****@gmail.com wrote:
I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page.


This is one reason I always author as XHTML. This would be pretty
easy with XSLT.

You should be able to do it with Perl, or most other scripting
languages. The ease of doing it depends on how "clean" the original
code is.

Jul 23 '05 #5
Andy Dingley wrote:
This is one reason I always author as XHTML. This would be pretty
easy with XSLT.


And this could be done also very easily (I should say - in an easier way
:o) ) using HTML and any programmation language including a RegExp or
DOM API, and there's a lot of them.
Jul 23 '05 #6
ma*****@gmail.com wrote:
I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page.


What you describe is the third example on the page describing markup
macros in mod_publisher. At its simplest you'd use

MLMacro a replace url @href

to replace all <a ...> links with the contents of a page referenced in
the href attribute.

Don't forget that if you're inserting HTML, you need to preprocess it
to remove everything that isn't body contents. To do that you'd apply
several macros to the included page:

MLMacro html replace start ""
MLMacro html replace end ""
MLMacro head hide
MLMacro body replace start "<div class=\"included\">"
MLMacro body replace end </div>

If you're processing badly broken markup, you might also need
to apply MLExtendedFixups. But for anything half-decent, the
above should be sufficient.

http://apache.webthing.com/mod_publisher/

--
Nick Kew
Jul 23 '05 #7

ma*****@gmail.com wrote:
If this is not the right group for this question, please advise me of a better one.

I have a collection of simple HTML files, many of which just contain a paragraph or two of text. Some contain just an IMG and a one line
caption. I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the link with the BODY of the linked page. If that BODY contains further
links, it should be similarly processed. (I don't care if the file is
processed recursively or if I have to run several passes of the tool.)


This would definitely be (reasonably) easy in most modern scripting
languages. My personal favourite is Python. There is a Python HTML
parser called beautiful soup that would almost make this trivial.

Are the links relative (i.e. should they be loaded from a filesystem)
or absolute URLs (loaded from the internet) ?

I've already used BeautifulSoup to write a link checker that crawls all
URLs within a single domain. See
http://www.viodspace.org.uk/python/programs.shtml

I can ahck it over the weekend to do what you need.... You'll need to
wait until Tuesday though - I'm off the internet until then. The first
version will fetch files from a local filesystem and just insert the
contents of the BODY tag (recursively) instead of the link. If you want
any additional processing we can discuss it.

Regards,

Fuzzy
http://www.voidspace.org.uk/python

[snip..]

Jul 23 '05 #8
I'm still happy to do this, if it's actually needed.

Best Regards,

Fuzzy
http://www.voidspace.org.uk/python

Jul 24 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

23
by: Francis Avila | last post by:
Below is an implementation a 'flattening' recursive generator (take a nested iterator and remove all its nesting). Is this possibly general and useful enough to be included in itertools? (I know...
3
by: Bengt Richter | last post by:
What am I missing? (this is from 2.4b1, so probably it has been fixed?) def flatten(list): l = for elt in list: ^^^^--must be expecting list instance or other sequence t = type(elt) if t...
18
by: Ville Vainio | last post by:
For quick-and-dirty stuff, it's often convenient to flatten a sequence (which perl does, surprise surprise, by default): ]]] -> One such implementation is at ...
181
by: Tom Anderson | last post by:
Comrades, During our current discussion of the fate of functional constructs in python, someone brought up Guido's bull on the matter: http://www.artima.com/weblogs/viewpost.jsp?thread=98196 ...
2
by: wenmang | last post by:
Hi, As part of simple serialization, I like to determine which is the right way to do: flatten a class containing flat C-structs with some member functions or just plain C-structs. We need to...
2
by: windandwaves | last post by:
Hi Folk, I use AJAX to load some XML. When I get back to XML, I want to get a piece of html that is within <info>... lots of html .... </info> I want to use: ...
25
by: beginner | last post by:
Hi, I am wondering how do I 'flatten' a list or a tuple? For example, I'd like to transform or ] to . Another question is how do I pass a tuple or list of all the aurgements of a function to...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.