Flatten a collection of HTML files into one

mark126

If this is not the right group for this question, please advise me of a
better one.

I have a collection of simple HTML files, many of which just contain a
paragraph or two of text. Some contain just an IMG and a one line
caption. I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page. If that BODY contains further
links, it should be similarly processed. (I don't care if the file is
processed recursively or if I have to run several passes of the tool.)

For example, suppose I have 4 files like so (I'm not showing the HTML
tags):

toc.html
-------
Table of Contents
Article 1
[link to article1.html]
Article 2
[link to article2.html]
article1.html
-----------
This is article 1. It is very short.

article2.html
-----------
This is article 2. It contains a tip.
[link to tip.html]

tip.html
-------
Don't put things in your ears.
I want to be able to process toc.html and end up with an HTML file like
this:

newtoc.html
----------
Table of Contents
Article 1
This is article 1. It is very short.
Article 2
This is article 2. It contains a tip.
Don't put things in your ears.
In my case, the nesting goes a few levels deeper. Some articles have 20
categories of tips, each category page has one or more tips linked by
the title of the tip. I would like to retain the HTML markup (it is
just very simple headings, paragraphs, italics, etc.) so that I can
process the combined file with html2latex and end up with a nice
looking PDF I can print and read away from a computer.

I am sure I am not the first person wanting to do something like this,
but so far I have not been able to come up with the right input to
Google to find a premade tool.

Mark

Jul 23 '05 #1

Subscribe Reply

2408

Els

ma*****@gmail.c om wrote:

If this is not the right group for this question, please advise me of a
better one.

I have a collection of simple HTML files, many of which just contain a
paragraph or two of text. Some contain just an IMG and a one line
caption. I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page. If that BODY contains further
links, it should be similarly processed. (I don't care if the file is
processed recursively or if I have to run several passes of the tool.)

For example, suppose I have 4 files like so (I'm not showing the HTML
tags):

toc.html
-------
Table of Contents
Article 1
[link to article1.html]
Article 2
[link to article2.html]
article1.html
-----------
This is article 1. It is very short.

article2.html
-----------
This is article 2. It contains a tip.
[link to tip.html]

tip.html
-------
Don't put things in your ears.
I want to be able to process toc.html and end up with an HTML file like
this:

newtoc.html
----------
Table of Contents
Article 1
This is article 1. It is very short.
Article 2
This is article 2. It contains a tip.
Don't put things in your ears.
In my case, the nesting goes a few levels deeper. Some articles have 20
categories of tips, each category page has one or more tips linked by
the title of the tip. I would like to retain the HTML markup (it is
just very simple headings, paragraphs, italics, etc.) so that I can
process the combined file with html2latex and end up with a nice
looking PDF I can print and read away from a computer.

I am sure I am not the first person wanting to do something like this,
but so far I have not been able to come up with the right input to
Google to find a premade tool.

Do you really want to use links, as in <a href=...>, or are you just
looking for a method? If you have PHP installed on the server, you can
just call these other files like so: <?php include "filename.e xt" ?>

Any serverside language will do though, and then there's Server Side
Includes (SSI) (which I know nothing about :-) )

--
Els http://locusmeus.com/
Sonhos vem. Sonhos vão. O resto é imperfeito.
- Renato Russo -
Now playing: Pearl Jam - Dirty Frank

Jul 23 '05 #2

mark126

I already have the files, and they do use links. I didn't make them
this way -- I'm just trying to make the best of a bad situation. I
think they were probably made this way because they are from the early
1990s and were probably accessed over slow dialup connections.

I forgot to mention in my first post that I am using Mac OS X, but I
have access to Linux, Windows, and DOS platforms too.

Mark

Jul 23 '05 #3

Els

ma*****@gmail.c om wrote:

I already have the files, and they do use links. I didn't make them
this way -- I'm just trying to make the best of a bad situation. I
think they were probably made this way because they are from the early
1990s and were probably accessed over slow dialup connections.

I forgot to mention in my first post that I am using Mac OS X, but I
have access to Linux, Windows, and DOS platforms too.

I have a good memory, and I do remember my post of half an hour ago,
and even yours (although not literally). But not everybody who sees
your message has seen or remembered the previous one in the thread.
So, please quote the relevant bits of the post you are replying to,
and reply underneath.

Back to your question: am I understanding you correctly, that
basically you want to change a bunch of regular links to in-page
anchor links?

I think with a bit of good thinking and some regex in combination with
includes you can actually do that. Personally I'd call a friend with
programming skills to do it for me ;-)

If you just want to have them all in one file, as if the links were
replaced by the files, then why not just replace
<a href="pageX.htm l">Page X</a> with <?php include "pageX.html " ?> ?

--
Els http://locusmeus.com/
Sonhos vem. Sonhos vão. O resto é imperfeito.
- Renato Russo -
Now playing: Pearl Jam - Alive (live)

Jul 23 '05 #4

Andy Dingley

On 24 Mar 2005 11:39:36 -0800, ma*****@gmail.c om wrote:

I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page.

This is one reason I always author as XHTML. This would be pretty
easy with XSLT.

You should be able to do it with Perl, or most other scripting
languages. The ease of doing it depends on how "clean" the original
code is.

Jul 23 '05 #5

Pierre Goiffon

Andy Dingley wrote:

This is one reason I always author as XHTML. This would be pretty
easy with XSLT.

And this could be done also very easily (I should say - in an easier way
:o) ) using HTML and any programmation language including a RegExp or
DOM API, and there's a lot of them.

Jul 23 '05 #6

Nick Kew

ma*****@gmail.c om wrote:

I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page.

What you describe is the third example on the page describing markup
macros in mod_publisher. At its simplest you'd use

MLMacro a replace url @href

to replace all <a ...> links with the contents of a page referenced in
the href attribute.

Don't forget that if you're inserting HTML, you need to preprocess it
to remove everything that isn't body contents. To do that you'd apply
several macros to the included page:

MLMacro html replace start ""
MLMacro html replace end ""
MLMacro head hide
MLMacro body replace start "<div class=\"include d\">"
MLMacro body replace end </div>

If you're processing badly broken markup, you might also need
to apply MLExtendedFixup s. But for anything half-decent, the
above should be sufficient.

http://apache.webthing.com/mod_publisher/

--
Nick Kew

Jul 23 '05 #7

Fuzzyman

ma*****@gmail.c om wrote:

If this is not the right group for this question, please advise me of a better one.

I have a collection of simple HTML files, many of which just contain a paragraph or two of text. Some contain just an IMG and a one line
caption. I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the link with the BODY of the linked page. If that BODY contains further
links, it should be similarly processed. (I don't care if the file is
processed recursively or if I have to run several passes of the tool.)

This would definitely be (reasonably) easy in most modern scripting
languages. My personal favourite is Python. There is a Python HTML
parser called beautiful soup that would almost make this trivial.

Are the links relative (i.e. should they be loaded from a filesystem)
or absolute URLs (loaded from the internet) ?

I've already used BeautifulSoup to write a link checker that crawls all
URLs within a single domain. See
http://www.viodspace.org.uk/python/programs.shtml

I can ahck it over the weekend to do what you need.... You'll need to
wait until Tuesday though - I'm off the internet until then. The first
version will fetch files from a local filesystem and just insert the
contents of the BODY tag (recursively) instead of the link. If you want
any additional processing we can discuss it.

Regards,

Fuzzy
http://www.voidspace.org.uk/python

[snip..]

Jul 23 '05 #8

Fuzzyman

I'm still happy to do this, if it's actually needed.

Best Regards,

Fuzzy
http://www.voidspace.org.uk/python

Jul 24 '05 #9

Similar topics

3740

itertools.flatten()? and copying generators/iterators.

by: Francis Avila | last post by:

Below is an implementation a 'flattening' recursive generator (take a nested iterator and remove all its nesting). Is this possibly general and useful enough to be included in itertools? (I know *I* wanted something like it...) Very basic examples: >>> rl = , '678', 9]] >>> list(flatten(rl)) >>> notstring = lambda obj: not isinstance(obj, type(''))

Python

1355

Has apparent 2.4b1 bug been fixed? flatten in Lib\compiler\ast.py overloads 'list' name

by: Bengt Richter | last post by:

What am I missing? (this is from 2.4b1, so probably it has been fixed?) def flatten(list): l = for elt in list: ^^^^--must be expecting list instance or other sequence t = type(elt) if t is tuple or t is list: ^^^^--looks like it expects to refer to the type, not the arg

Python

2633

Wishlist item: itertools.flatten

by: Ville Vainio | last post by:

For quick-and-dirty stuff, it's often convenient to flatten a sequence (which perl does, surprise surprise, by default): ]]] -> One such implementation is at http://aspn.activestate.com/ASPN/Mail/Message/python-tutor/2302348

Python

181

8918

map/filter/reduce/lambda opinions and background unscientificmini-survey

by: Tom Anderson | last post by:

Comrades, During our current discussion of the fate of functional constructs in python, someone brought up Guido's bull on the matter: http://www.artima.com/weblogs/viewpost.jsp?thread=98196 He says he's going to dispose of map, filter, reduce and lambda. He's going to give us product, any and all, though, which is nice of him.

Python

3921

flatten class and flatten struct

by: wenmang | last post by:

Hi, As part of simple serialization, I like to determine which is the right way to do: flatten a class containing flat C-structs with some member functions or just plain C-structs. We need to store those data as context in shared memory. I just want to know what is pro and cons for this idea: class Context { public: memFun1();

C / C++

1938

flatten an array from xml

by: windandwaves | last post by:

Hi Folk, I use AJAX to load some XML. When I get back to XML, I want to get a piece of html that is within <info>... lots of html .... </info> I want to use: xmlDoc.getElementsByTagName('info'); but that just returns

Javascript

4101

Flatten a list/tuple and Call a function with tuples

by: beginner | last post by:

Hi, I am wondering how do I 'flatten' a list or a tuple? For example, I'd like to transform or ] to . Another question is how do I pass a tuple or list of all the aurgements of a function to the function. For example, I have all the arguments of a function in a tuple a=(1,2,3). Then I want to pass each item in the tuple to a function f so that I make a function call f(1,2,3). In perl it is a given, but in python, I haven't figured out

Python

9684

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9530

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10459

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10182

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

10017

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7552

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5445

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4120

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3734

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP