472,122 Members | 1,449 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,122 software developers and data experts.

XSL for removing words less than 4 letters in a sitemap

I need to transform this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</
loc>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
</url>
</urlset>

into this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Books/Paths-for-the-
extreme-player</loc>
<news:news>
<news:keywords>Books, Paths, extreme, player</
news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
<news:news>
<news:keywords>Games, edge, wall</news:keywords>
</news:news>
</url>
</urlset>

I mean, I need a template for creating a <news:keywordstag which
contents all the words from <loctag with words of more than 3
letters.
Apr 1 '08 #1
6 2072
Olagato wrote:
I need to transform this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</
loc>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
</url>
</urlset>

into this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Books/Paths-for-the-
extreme-player</loc>
<news:news>
<news:keywords>Books, Paths, extreme, player</
news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
<news:news>
<news:keywords>Games, edge, wall</news:keywords>
</news:news>
</url>
</urlset>

I mean, I need a template for creating a <news:keywordstag which
contents all the words from <loctag with words of more than 3
letters.
Do you want to use XSLT 2.0 or 1.0?
What about words like 'localhost' or 'index', how do you decide that
those are not taken?

Here is an XSLT 2.0 stylesheet that should show you an approach using
the tokenize method:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://example.com/2008/news"
xmlns:sm="http://www.google.com/schemas/sitemap/0.84"
exclude-result-prefixes="sm"
version="2.0">

<xsl:output method="xml" indent="yes"/>

<xsl:strip-space elements="*"/>

<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>

<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
<news:news>
<news:keywords>
<xsl:value-of
select="for $s in tokenize(sm:loc, '/')[position() &gt; 5]
return tokenize($s, '[\-/]')[string-length(.) &gt; 3]"
separator=", "/>
</news:keywords>
</news:news>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

Result with Saxon 9 when run against your posted input sample (with a
'root' element added and a namespace choosen for the 'news' prefix) is

<root>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>

<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</loc>
<news:news xmlns:news="http://example.com/2008/news">
<news:keywords>Paths, extreme, player</news:keywords>
</news:news>
</url>
<url>

<loc>http://localhost/index.php/index.php/Games/The-edge-of-the-wall</loc>
<news:news xmlns:news="http://example.com/2008/news">
<news:keywords>Games, edge, wall</news:keywords>
</news:news>
</url>
</urlset>
</root>
--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 2 '08 #2
Olagato wrote:
>Do you want to use XSLT 2.0 or 1.0?
I'm using XSLT 1.0
>What about words like 'localhost' or 'index', how do you decide that those are not taken?
It's not a problem now. Maybe a sentence like next:
translate( translate( substring-after( sm:loc, 'http://localhost/
index.php/index.php/') ,'-', ',') ,'/',',')

I'm trying your XSL from PHP without success:
PHP only supports XSLT 1.0 so my posted stylesheet using XSLT and XPath
2.0 functionality does not work with PHP's XSLT processor.
--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 3 '08 #3
On 3 abr, 13:06, Martin Honnen <mahotr...@yahoo.dewrote:
Olagato wrote:
Do you want to use XSLT 2.0 or 1.0?
I'm using XSLT 1.0
What about words like 'localhost' or 'index', how do you decide that those are not taken?
It's not a problem now. Maybe a sentence like next:
translate( translate( substring-after( sm:loc, 'http://localhost/
index.php/index.php/') ,'-', ',') ,'/',',')
I'm trying your XSL from PHP without success:

PHP only supports XSLT 1.0 so my posted stylesheet using XSLT and XPath
2.0 functionality does not work with PHP's XSLT processor.

--

Martin Honnen
http://JavaScript.FAQTs.com/
Your posted version in 1.0 functionality seems to be quite difficult
to implement because of lack of advanced functions (at least for a xsl
newbie like me) So my only alternative would be to use a XSLT
processor. I'll try Xalan on server: http://xalan.apache.org/
Any other idea using XSLT 1.0 will be appreciated.
Apr 3 '08 #4
Olagato wrote:
Your posted version in 1.0 functionality seems to be quite difficult
to implement because of lack of advanced functions (at least for a xsl
newbie like me) So my only alternative would be to use a XSLT
processor. I'll try Xalan on server: http://xalan.apache.org/
Any other idea using XSLT 1.0 will be appreciated.
Xalan does not do XSLT 2.0 so if you want to use XSLT 2.0 then try Saxon
(http://saxon.sourceforge.net/) or Gestalt
(http://gestalt.sourceforge.net/) or AltovaXML
(http://www.altova.com/altovaxml.html).

If you want to use PHP then I think PHP supports EXSLT so you could try
to use http://www.exslt.org/str/functions/tokenize/index.html

--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 3 '08 #5
On 3 abr, 16:45, Martin Honnen <mahotr...@yahoo.dewrote:
Olagato wrote:
Your posted version in 1.0 functionality seems to be quite difficult
to implement because of lack of advanced functions (at least for a xsl
newbie like me) So my only alternative would be to use a XSLT
processor. I'll try Xalan on server:http://xalan.apache.org/
Any other idea using XSLT 1.0 will be appreciated.

Xalan does not do XSLT 2.0 so if you want to use XSLT 2.0 then try Saxon
(http://saxon.sourceforge.net/) or Gestalt
(http://gestalt.sourceforge.net/) or AltovaXML
(http://www.altova.com/altovaxml.html).

If you want to use PHP then I think PHP supports EXSLT so you could try
to usehttp://www.exslt.org/str/functions/tokenize/index.html

--

Martin Honnen
http://JavaScript.FAQTs.com/
Thank you very much, Martin
It's now working fine with Altova XML Spy and Saxon9 as external XSLT
parser:
http://216.239.59.104/search?q=cache...ient=firefox-a

There are only 2 little issues left:

My XML input is:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost/index.php/index.php...site/Rutas-de-
verano-en-España</loc>
<lastmod>2008-03-13</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://localhost/index.php/index.php...site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<lastmod>2008-02-12</lastmod>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
</url>
</urlset>

Your XSLT 2.0 is:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9" exclude-result-
prefixes="sm" version="2.0">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
<news:news>
<news:publication_date>
<xsl:value-of select="sm:lastmod"/>
</news:publication_date>
<news:keywords>
<xsl:value-of select="for $s in tokenize(sm:loc, '/')[position()
&gt; 5]
return tokenize($s, '[\-/]')[string-length(.)
&gt; 3]" separator=", "/>
</news:keywords>
</news:news>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

The output is:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost/index.php/index.php...site/Rutas-de-
verano-en-España</loc>
<lastmod>2008-03-13</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
<news:news xmlns:news="http://www.google.com/schemas/sitemap-news/
0.9">
<news:publication_date>2008-03-13</news:publication_date>
<news:keywords>ezwebin_site, Rutas, verano, España</news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<lastmod>2008-02-12</lastmod>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
<news:news xmlns:news="http://www.google.com/schemas/sitemap-news/
0.9">
<news:publication_date>2008-02-12</news:publication_date>
<news:keywords>ezwebin_site, Rutas, Camino, Santiago, rt</
news:keywords>
</news:news>
</url>
</urlset>

But I need an output like defined by News Sitemap Protocol:
http://www.google.com/support/webmas...y?answer=42738

So there are 2 things left:
1- <lastmodtags should dissapear from <urloutputs because a
<news:publication_datetag has been defined already.
2- xmlns:news namespace should dissapear from <news:newstags and it
should be taken to the <urlset xmlns="http://www.sitemaps.org/schemas/
sitemap/0.9"tag in the header.

A good output file would be:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
<url>
<loc>http://localhost/index.php/index.php...site/Rutas-de-
verano-en-España</loc>
<changefreq>daily</changefreq>
<priority>0.8</priority>
<news:news>
<news:publication_date>2008-03-13</news:publication_date>
<news:keywords>ezwebin_site, Rutas, verano, España</news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
<news:news>
<news:publication_date>2008-02-12</news:publication_date>
<news:keywords>ezwebin_site, Rutas, Camino, Santiago, rt</
news:keywords>
</news:news>
</url>
</urlset>

Any idea ?


Apr 8 '08 #6
Olagato wrote:
So there are 2 things left:
1- <lastmodtags should dissapear from <urloutputs because a
<news:publication_datetag has been defined already.
2- xmlns:news namespace should dissapear from <news:newstags and it
should be taken to the <urlset xmlns="http://www.sitemaps.org/schemas/
sitemap/0.9"tag in the header.
Both are easy adaptions, you need to use a predicate
[not(self::sm:lastmod)] and you can use xsl:namespace to make sure a
namespace declaration is created on the root element:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9"
exclude-result-prefixes="sm"
version="2.0">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:urlset">
<xsl:copy>
<xsl:namespace name="news"
select="'http://www.google.com/schemas/sitemap-news/0.9'"/>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()[not(self::sm:lastmod)]"/>
<news:news>
<news:publication_date>
<xsl:value-of select="sm:lastmod"/>
</news:publication_date>
<news:keywords>
<xsl:value-of select="for $s in tokenize(sm:loc, '/')[position()
&gt; 5]
return tokenize($s, '[\-/]')[string-length(.)
&gt; 3]" separator=", "/>
</news:keywords>
</news:news>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 9 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by Aristotle | last post: by
9 posts views Thread by dan | last post: by
4 posts views Thread by OpticTygre | last post: by
reply views Thread by Bill Mild | last post: by
5 posts views Thread by JJ | last post: by
4 posts views Thread by shapper | last post: by
4 posts views Thread by shapper | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.