473,796 Members | 2,677 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Flatten a collection of HTML files into one

If this is not the right group for this question, please advise me of a
better one.

I have a collection of simple HTML files, many of which just contain a
paragraph or two of text. Some contain just an IMG and a one line
caption. I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page. If that BODY contains further
links, it should be similarly processed. (I don't care if the file is
processed recursively or if I have to run several passes of the tool.)

For example, suppose I have 4 files like so (I'm not showing the HTML
tags):

toc.html
-------
Table of Contents
Article 1
[link to article1.html]
Article 2
[link to article2.html]
article1.html
-----------
This is article 1. It is very short.

article2.html
-----------
This is article 2. It contains a tip.
[link to tip.html]

tip.html
-------
Don't put things in your ears.
I want to be able to process toc.html and end up with an HTML file like
this:

newtoc.html
----------
Table of Contents
Article 1
This is article 1. It is very short.
Article 2
This is article 2. It contains a tip.
Don't put things in your ears.
In my case, the nesting goes a few levels deeper. Some articles have 20
categories of tips, each category page has one or more tips linked by
the title of the tip. I would like to retain the HTML markup (it is
just very simple headings, paragraphs, italics, etc.) so that I can
process the combined file with html2latex and end up with a nice
looking PDF I can print and read away from a computer.

I am sure I am not the first person wanting to do something like this,
but so far I have not been able to come up with the right input to
Google to find a premade tool.

Mark

Jul 23 '05 #1
8 2408
Els
ma*****@gmail.c om wrote:
If this is not the right group for this question, please advise me of a
better one.

I have a collection of simple HTML files, many of which just contain a
paragraph or two of text. Some contain just an IMG and a one line
caption. I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page. If that BODY contains further
links, it should be similarly processed. (I don't care if the file is
processed recursively or if I have to run several passes of the tool.)

For example, suppose I have 4 files like so (I'm not showing the HTML
tags):

toc.html
-------
Table of Contents
Article 1
[link to article1.html]
Article 2
[link to article2.html]
article1.html
-----------
This is article 1. It is very short.

article2.html
-----------
This is article 2. It contains a tip.
[link to tip.html]

tip.html
-------
Don't put things in your ears.
I want to be able to process toc.html and end up with an HTML file like
this:

newtoc.html
----------
Table of Contents
Article 1
This is article 1. It is very short.
Article 2
This is article 2. It contains a tip.
Don't put things in your ears.
In my case, the nesting goes a few levels deeper. Some articles have 20
categories of tips, each category page has one or more tips linked by
the title of the tip. I would like to retain the HTML markup (it is
just very simple headings, paragraphs, italics, etc.) so that I can
process the combined file with html2latex and end up with a nice
looking PDF I can print and read away from a computer.

I am sure I am not the first person wanting to do something like this,
but so far I have not been able to come up with the right input to
Google to find a premade tool.


Do you really want to use links, as in <a href=...>, or are you just
looking for a method? If you have PHP installed on the server, you can
just call these other files like so: <?php include "filename.e xt" ?>

Any serverside language will do though, and then there's Server Side
Includes (SSI) (which I know nothing about :-) )

--
Els http://locusmeus.com/
Sonhos vem. Sonhos vão. O resto é imperfeito.
- Renato Russo -
Now playing: Pearl Jam - Dirty Frank
Jul 23 '05 #2
I already have the files, and they do use links. I didn't make them
this way -- I'm just trying to make the best of a bad situation. I
think they were probably made this way because they are from the early
1990s and were probably accessed over slow dialup connections.

I forgot to mention in my first post that I am using Mac OS X, but I
have access to Linux, Windows, and DOS platforms too.

Mark

Jul 23 '05 #3
Els
ma*****@gmail.c om wrote:
I already have the files, and they do use links. I didn't make them
this way -- I'm just trying to make the best of a bad situation. I
think they were probably made this way because they are from the early
1990s and were probably accessed over slow dialup connections.

I forgot to mention in my first post that I am using Mac OS X, but I
have access to Linux, Windows, and DOS platforms too.


I have a good memory, and I do remember my post of half an hour ago,
and even yours (although not literally). But not everybody who sees
your message has seen or remembered the previous one in the thread.
So, please quote the relevant bits of the post you are replying to,
and reply underneath.

Back to your question: am I understanding you correctly, that
basically you want to change a bunch of regular links to in-page
anchor links?

I think with a bit of good thinking and some regex in combination with
includes you can actually do that. Personally I'd call a friend with
programming skills to do it for me ;-)

If you just want to have them all in one file, as if the links were
replaced by the files, then why not just replace
<a href="pageX.htm l">Page X</a> with <?php include "pageX.html " ?> ?

--
Els http://locusmeus.com/
Sonhos vem. Sonhos vão. O resto é imperfeito.
- Renato Russo -
Now playing: Pearl Jam - Alive (live)
Jul 23 '05 #4
On 24 Mar 2005 11:39:36 -0800, ma*****@gmail.c om wrote:
I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page.


This is one reason I always author as XHTML. This would be pretty
easy with XSLT.

You should be able to do it with Perl, or most other scripting
languages. The ease of doing it depends on how "clean" the original
code is.

Jul 23 '05 #5
Andy Dingley wrote:
This is one reason I always author as XHTML. This would be pretty
easy with XSLT.


And this could be done also very easily (I should say - in an easier way
:o) ) using HTML and any programmation language including a RegExp or
DOM API, and there's a lot of them.
Jul 23 '05 #6
ma*****@gmail.c om wrote:
I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the
link with the BODY of the linked page.


What you describe is the third example on the page describing markup
macros in mod_publisher. At its simplest you'd use

MLMacro a replace url @href

to replace all <a ...> links with the contents of a page referenced in
the href attribute.

Don't forget that if you're inserting HTML, you need to preprocess it
to remove everything that isn't body contents. To do that you'd apply
several macros to the included page:

MLMacro html replace start ""
MLMacro html replace end ""
MLMacro head hide
MLMacro body replace start "<div class=\"include d\">"
MLMacro body replace end </div>

If you're processing badly broken markup, you might also need
to apply MLExtendedFixup s. But for anything half-decent, the
above should be sufficient.

http://apache.webthing.com/mod_publisher/

--
Nick Kew
Jul 23 '05 #7

ma*****@gmail.c om wrote:
If this is not the right group for this question, please advise me of a better one.

I have a collection of simple HTML files, many of which just contain a paragraph or two of text. Some contain just an IMG and a one line
caption. I would like to find a tool that will load the first page
(call it toc.html) and anywhere it finds a link, it should replace the link with the BODY of the linked page. If that BODY contains further
links, it should be similarly processed. (I don't care if the file is
processed recursively or if I have to run several passes of the tool.)


This would definitely be (reasonably) easy in most modern scripting
languages. My personal favourite is Python. There is a Python HTML
parser called beautiful soup that would almost make this trivial.

Are the links relative (i.e. should they be loaded from a filesystem)
or absolute URLs (loaded from the internet) ?

I've already used BeautifulSoup to write a link checker that crawls all
URLs within a single domain. See
http://www.viodspace.org.uk/python/programs.shtml

I can ahck it over the weekend to do what you need.... You'll need to
wait until Tuesday though - I'm off the internet until then. The first
version will fetch files from a local filesystem and just insert the
contents of the BODY tag (recursively) instead of the link. If you want
any additional processing we can discuss it.

Regards,

Fuzzy
http://www.voidspace.org.uk/python

[snip..]

Jul 23 '05 #8
I'm still happy to do this, if it's actually needed.

Best Regards,

Fuzzy
http://www.voidspace.org.uk/python

Jul 24 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

23
3740
by: Francis Avila | last post by:
Below is an implementation a 'flattening' recursive generator (take a nested iterator and remove all its nesting). Is this possibly general and useful enough to be included in itertools? (I know *I* wanted something like it...) Very basic examples: >>> rl = , '678', 9]] >>> list(flatten(rl)) >>> notstring = lambda obj: not isinstance(obj, type(''))
3
1355
by: Bengt Richter | last post by:
What am I missing? (this is from 2.4b1, so probably it has been fixed?) def flatten(list): l = for elt in list: ^^^^--must be expecting list instance or other sequence t = type(elt) if t is tuple or t is list: ^^^^--looks like it expects to refer to the type, not the arg
18
2633
by: Ville Vainio | last post by:
For quick-and-dirty stuff, it's often convenient to flatten a sequence (which perl does, surprise surprise, by default): ]]] -> One such implementation is at http://aspn.activestate.com/ASPN/Mail/Message/python-tutor/2302348
181
8918
by: Tom Anderson | last post by:
Comrades, During our current discussion of the fate of functional constructs in python, someone brought up Guido's bull on the matter: http://www.artima.com/weblogs/viewpost.jsp?thread=98196 He says he's going to dispose of map, filter, reduce and lambda. He's going to give us product, any and all, though, which is nice of him.
2
3921
by: wenmang | last post by:
Hi, As part of simple serialization, I like to determine which is the right way to do: flatten a class containing flat C-structs with some member functions or just plain C-structs. We need to store those data as context in shared memory. I just want to know what is pro and cons for this idea: class Context { public: memFun1();
2
1938
by: windandwaves | last post by:
Hi Folk, I use AJAX to load some XML. When I get back to XML, I want to get a piece of html that is within <info>... lots of html .... </info> I want to use: xmlDoc.getElementsByTagName('info'); but that just returns
25
4101
by: beginner | last post by:
Hi, I am wondering how do I 'flatten' a list or a tuple? For example, I'd like to transform or ] to . Another question is how do I pass a tuple or list of all the aurgements of a function to the function. For example, I have all the arguments of a function in a tuple a=(1,2,3). Then I want to pass each item in the tuple to a function f so that I make a function call f(1,2,3). In perl it is a given, but in python, I haven't figured out
0
9684
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9530
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10459
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10182
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10017
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7552
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5445
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4120
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3734
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.