By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,563 Members | 1,180 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,563 IT Pros & Developers. It's quick & easy.

Flat HTML headers to nested XML sections

P: n/a
I am working on creating an XSLT that transforms Html into an XML
format that can be imported into Framemaker. The challenge, it turns
out, is correctly transforming the flat html header tags (<H1>, <H2>,
etc)
into nested sections inside the xml. I have made significant
progress, but have run into a roadblock.

Here is an example of my input HTML:

<html><body>
<p>abc abc</p>
<h1 class='header'>A</h1>
<p>A abc abc</p>
<h2 class='header'>B</h2>
<p>B abc abc</p>
<h3 class='header'>C</h3>
<p>Cabc abc</p>
<h2 class='header'>D</h2<!-- this is missing in the output --
>
<p>D abc abc</p<!-- this is missing in the output -->
<h1 class='header'>E</h1>
<p>E abc abc</p>
</body></html>

Here is an example of the output, you'll notice that the <H2>D</h2>
is missing.

<?xml version="1.0" encoding="UTF-8"?>
<article>
<title/>
<para>abc abc</para>
<section depth="1" id="A">
<title>A</title>
<para>A abc abc</para>
<section depth="2" id="B">
<title>B</title>
<para>B abc abc</para>
<section depth="3" id="C">
<title>C</title>
<para>C abc abc</para>
</section>
</section>
</section>
<section depth="1" id="E">
<title>E</title>
<para>E abc abc</para>
</section>

The problem is that my code is currently applying templates to all
nodes following a header who's nearest preceding header is that same
header. For this reason when content follows a header which isn't
it's header (like an <h2following an <h3>) it doesn't get shown.
What I don't understand is how to fix it. Any help would much
appreciated. I'm not really an xsl guru, so I'm doing the best I can
to get through this.

Here is the relevant code from my xsl:

<xsl:template match="body">
<article>
<title>
<xsl:value-of select="$docTitle" />
</title>

<xsl:for-each select='child::*[not(preceding-
sibling::*[@class="header"])][not(@class="header")]'>
<xsl:apply-templates select="."/>
</xsl:for-each>

<xsl:variable name='depth'
select='substring(name(child::*[@class="header"][1]),2)'/>
<xsl:for-each select='child::*[@class="header"]
[substring(name(),
2)&lt;=$depth]'>
<xsl:apply-templates select="."/>
</xsl:for-each>

</article>
</xsl:template>

<xsl:template match="h1 | h2 | h3 | h4 | h5">
<xsl:call-template name="header">
<xsl:with-param name="depth" select="substring(name(),2)"/>
</xsl:call-template>
</xsl:template>

<xsl:template name="header">
<xsl:param name="depth"/>
<section>
<xsl:attribute name="depth">
<xsl:value-of select="$depth"/>
</xsl:attribute>

<xsl:attribute name="id">
<xsl:value-of select="translate(.,' ','')" />
</xsl:attribute>
<title><xsl:value-of select="."/></title>

<xsl:variable name='thisHeader' select='generate-id(.)'/>
<xsl:for-each select='following-sibling::*[$thisHeader=generate-
id(preceding-sibling::*[@class="header"][last()])]
[not(@class="header") or (@class="header" and substring(name(),2)>=
$depth)]'>
<xsl:apply-templates select="."/>
</xsl:for-each>

</section>

</xsl:template>

May 16 '07 #1
Share this Question
Share on Google+
3 Replies


P: n/a
CrazyAtlantaGuy wrote:
I am working on creating an XSLT that transforms Html into an XML
format that can be imported into Framemaker. The challenge, it turns
out, is correctly transforming the flat html header tags (<H1>, <H2>,
etc) into nested sections inside the xml.
This is called encapsulation, and there's a much neater way than writing
XSLT to try and reach-forward-down-the-tree-up-to-but-not-including the
next H1/H2/H3/etc.

1. Run Tidy to make the HTML into well-formed XHTML (tidy -nc -asxml)

2. Write a short script to turn the XHTML back into valid SGML
(remove NETs, namespaces)

3. Apply a DocType Declaration for the ISO 15445 HTML DTD, which
includes a DIV1/DIV2 containment structure, in "preparation" mode
(declare % Preparation as INCLUDE in the internal subset and use
pre-html as the declared root element type)

4. Run osgmlnorm to normalize the document: this adds the missing
markup, switches single quotes to double where possible, etc

<!doctype pre-html
public "ISO/IEC 15445:2000//DTD HyperText Markup Language//EN" [
<!entity % Preparation "include" >
]>
<PRE-HTML>
<HEAD>
<META CONTENT="HTML Tidy for Linux/x86 (vers 1 September 2005), see
www.w3.org" NAME="GENERATOR">
<TITLE></TITLE>
</HEAD>
<BODY>
<P>abc abc</P>
<H1 CLASS="header">A</H1>
<DIV1>
<P>A abc abc</P>
<H2 CLASS="header">B</H2>
<DIV2>
<P>B abc abc</P>
<H3 CLASS="header">C</H3>
<DIV3>
<P>Cabc abc</P>
</DIV3>
</DIV2>
<H2 CLASS="header">D</H2>
<DIV2>
<P>D abc abc</P>
</DIV2>
</DIV1>
<H1 CLASS="header">E</H1>
<DIV1>
<P>E abc abc</P>
</DIV1>
</BODY>
</PRE-HTML>

You can easily mess with the Preparation structure in the DTD if you
don't like the way they did it (I don't).

///Peter
May 16 '07 #2

P: n/a
You could try adapting something from the XSLT FAQ. Likely candidates
would be
http://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e424
or
http://www.dpawson.co.uk/xsl/sect2/N...tml#d5891e1051

Some of the other examples on that page may also be adaptable to this
question.

(It's always worth checking Dave's page; he has done an excellent job of
collecting useful answers from XSL-List, which is unofficial but has
been in existence since before XSL was a Recommendation and has had
participation by a lot of XSL's architects and implementers. I still try
to keep half an eye on that list, though I must admit I don't watch it
as closely as I should.)

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
May 17 '07 #3

P: n/a
On May 17, 12:37 am, Joe Kesselman <keshlam-nos...@comcast.netwrote:
You could try adapting something from the XSLT FAQ. Likely candidates
would behttp://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e424
orhttp://www.dpawson.co.uk/xsl/sect2/N4486.html#d5891e1051

Some of the other examples on that page may also be adaptable to this
question.

(It's always worth checking Dave's page; he has done an excellent job of
collecting useful answers from XSL-List, which is unofficial but has
been in existence since before XSL was a Recommendation and has had
participation by a lot of XSL's architects and implementers. I still try
to keep half an eye on that list, though I must admit I don't watch it
as closely as I should.)

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Thanks for the help!

May 22 '07 #4

This discussion thread is closed

Replies have been disabled for this discussion.