By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,469 Members | 996 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,469 IT Pros & Developers. It's quick & easy.

XSL for removing words less than 4 letters in a sitemap

P: n/a
I need to transform this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</
loc>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
</url>
</urlset>

into this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Books/Paths-for-the-
extreme-player</loc>
<news:news>
<news:keywords>Books, Paths, extreme, player</
news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
<news:news>
<news:keywords>Games, edge, wall</news:keywords>
</news:news>
</url>
</urlset>

I mean, I need a template for creating a <news:keywordstag which
contents all the words from <loctag with words of more than 3
letters.
Apr 1 '08 #1
Share this Question
Share on Google+
6 Replies


P: n/a
Olagato wrote:
I need to transform this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</
loc>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
</url>
</urlset>

into this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Books/Paths-for-the-
extreme-player</loc>
<news:news>
<news:keywords>Books, Paths, extreme, player</
news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
<news:news>
<news:keywords>Games, edge, wall</news:keywords>
</news:news>
</url>
</urlset>

I mean, I need a template for creating a <news:keywordstag which
contents all the words from <loctag with words of more than 3
letters.
Do you want to use XSLT 2.0 or 1.0?
What about words like 'localhost' or 'index', how do you decide that
those are not taken?

Here is an XSLT 2.0 stylesheet that should show you an approach using
the tokenize method:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://example.com/2008/news"
xmlns:sm="http://www.google.com/schemas/sitemap/0.84"
exclude-result-prefixes="sm"
version="2.0">

<xsl:output method="xml" indent="yes"/>

<xsl:strip-space elements="*"/>

<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>

<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
<news:news>
<news:keywords>
<xsl:value-of
select="for $s in tokenize(sm:loc, '/')[position() &gt; 5]
return tokenize($s, '[\-/]')[string-length(.) &gt; 3]"
separator=", "/>
</news:keywords>
</news:news>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

Result with Saxon 9 when run against your posted input sample (with a
'root' element added and a namespace choosen for the 'news' prefix) is

<root>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>

<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</loc>
<news:news xmlns:news="http://example.com/2008/news">
<news:keywords>Paths, extreme, player</news:keywords>
</news:news>
</url>
<url>

<loc>http://localhost/index.php/index.php/Games/The-edge-of-the-wall</loc>
<news:news xmlns:news="http://example.com/2008/news">
<news:keywords>Games, edge, wall</news:keywords>
</news:news>
</url>
</urlset>
</root>
--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 2 '08 #2

P: n/a
Olagato wrote:
>Do you want to use XSLT 2.0 or 1.0?
I'm using XSLT 1.0
>What about words like 'localhost' or 'index', how do you decide that those are not taken?
It's not a problem now. Maybe a sentence like next:
translate( translate( substring-after( sm:loc, 'http://localhost/
index.php/index.php/') ,'-', ',') ,'/',',')

I'm trying your XSL from PHP without success:
PHP only supports XSLT 1.0 so my posted stylesheet using XSLT and XPath
2.0 functionality does not work with PHP's XSLT processor.
--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 3 '08 #3

P: n/a
On 3 abr, 13:06, Martin Honnen <mahotr...@yahoo.dewrote:
Olagato wrote:
Do you want to use XSLT 2.0 or 1.0?
I'm using XSLT 1.0
What about words like 'localhost' or 'index', how do you decide that those are not taken?
It's not a problem now. Maybe a sentence like next:
translate( translate( substring-after( sm:loc, 'http://localhost/
index.php/index.php/') ,'-', ',') ,'/',',')
I'm trying your XSL from PHP without success:

PHP only supports XSLT 1.0 so my posted stylesheet using XSLT and XPath
2.0 functionality does not work with PHP's XSLT processor.

--

Martin Honnen
http://JavaScript.FAQTs.com/
Your posted version in 1.0 functionality seems to be quite difficult
to implement because of lack of advanced functions (at least for a xsl
newbie like me) So my only alternative would be to use a XSLT
processor. I'll try Xalan on server: http://xalan.apache.org/
Any other idea using XSLT 1.0 will be appreciated.
Apr 3 '08 #4

P: n/a
Olagato wrote:
Your posted version in 1.0 functionality seems to be quite difficult
to implement because of lack of advanced functions (at least for a xsl
newbie like me) So my only alternative would be to use a XSLT
processor. I'll try Xalan on server: http://xalan.apache.org/
Any other idea using XSLT 1.0 will be appreciated.
Xalan does not do XSLT 2.0 so if you want to use XSLT 2.0 then try Saxon
(http://saxon.sourceforge.net/) or Gestalt
(http://gestalt.sourceforge.net/) or AltovaXML
(http://www.altova.com/altovaxml.html).

If you want to use PHP then I think PHP supports EXSLT so you could try
to use http://www.exslt.org/str/functions/tokenize/index.html

--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 3 '08 #5

P: n/a
On 3 abr, 16:45, Martin Honnen <mahotr...@yahoo.dewrote:
Olagato wrote:
Your posted version in 1.0 functionality seems to be quite difficult
to implement because of lack of advanced functions (at least for a xsl
newbie like me) So my only alternative would be to use a XSLT
processor. I'll try Xalan on server:http://xalan.apache.org/
Any other idea using XSLT 1.0 will be appreciated.

Xalan does not do XSLT 2.0 so if you want to use XSLT 2.0 then try Saxon
(http://saxon.sourceforge.net/) or Gestalt
(http://gestalt.sourceforge.net/) or AltovaXML
(http://www.altova.com/altovaxml.html).

If you want to use PHP then I think PHP supports EXSLT so you could try
to usehttp://www.exslt.org/str/functions/tokenize/index.html

--

Martin Honnen
http://JavaScript.FAQTs.com/
Thank you very much, Martin
It's now working fine with Altova XML Spy and Saxon9 as external XSLT
parser:
http://216.239.59.104/search?q=cache...ient=firefox-a

There are only 2 little issues left:

My XML input is:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost/index.php/index.php...site/Rutas-de-
verano-en-España</loc>
<lastmod>2008-03-13</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://localhost/index.php/index.php...site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<lastmod>2008-02-12</lastmod>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
</url>
</urlset>

Your XSLT 2.0 is:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9" exclude-result-
prefixes="sm" version="2.0">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
<news:news>
<news:publication_date>
<xsl:value-of select="sm:lastmod"/>
</news:publication_date>
<news:keywords>
<xsl:value-of select="for $s in tokenize(sm:loc, '/')[position()
&gt; 5]
return tokenize($s, '[\-/]')[string-length(.)
&gt; 3]" separator=", "/>
</news:keywords>
</news:news>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

The output is:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost/index.php/index.php...site/Rutas-de-
verano-en-España</loc>
<lastmod>2008-03-13</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
<news:news xmlns:news="http://www.google.com/schemas/sitemap-news/
0.9">
<news:publication_date>2008-03-13</news:publication_date>
<news:keywords>ezwebin_site, Rutas, verano, España</news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<lastmod>2008-02-12</lastmod>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
<news:news xmlns:news="http://www.google.com/schemas/sitemap-news/
0.9">
<news:publication_date>2008-02-12</news:publication_date>
<news:keywords>ezwebin_site, Rutas, Camino, Santiago, rt</
news:keywords>
</news:news>
</url>
</urlset>

But I need an output like defined by News Sitemap Protocol:
http://www.google.com/support/webmas...y?answer=42738

So there are 2 things left:
1- <lastmodtags should dissapear from <urloutputs because a
<news:publication_datetag has been defined already.
2- xmlns:news namespace should dissapear from <news:newstags and it
should be taken to the <urlset xmlns="http://www.sitemaps.org/schemas/
sitemap/0.9"tag in the header.

A good output file would be:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
<url>
<loc>http://localhost/index.php/index.php...site/Rutas-de-
verano-en-España</loc>
<changefreq>daily</changefreq>
<priority>0.8</priority>
<news:news>
<news:publication_date>2008-03-13</news:publication_date>
<news:keywords>ezwebin_site, Rutas, verano, España</news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
<news:news>
<news:publication_date>2008-02-12</news:publication_date>
<news:keywords>ezwebin_site, Rutas, Camino, Santiago, rt</
news:keywords>
</news:news>
</url>
</urlset>

Any idea ?


Apr 8 '08 #6

P: n/a
Olagato wrote:
So there are 2 things left:
1- <lastmodtags should dissapear from <urloutputs because a
<news:publication_datetag has been defined already.
2- xmlns:news namespace should dissapear from <news:newstags and it
should be taken to the <urlset xmlns="http://www.sitemaps.org/schemas/
sitemap/0.9"tag in the header.
Both are easy adaptions, you need to use a predicate
[not(self::sm:lastmod)] and you can use xsl:namespace to make sure a
namespace declaration is created on the root element:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9"
exclude-result-prefixes="sm"
version="2.0">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:urlset">
<xsl:copy>
<xsl:namespace name="news"
select="'http://www.google.com/schemas/sitemap-news/0.9'"/>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()[not(self::sm:lastmod)]"/>
<news:news>
<news:publication_date>
<xsl:value-of select="sm:lastmod"/>
</news:publication_date>
<news:keywords>
<xsl:value-of select="for $s in tokenize(sm:loc, '/')[position()
&gt; 5]
return tokenize($s, '[\-/]')[string-length(.)
&gt; 3]" separator=", "/>
</news:keywords>
</news:news>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 9 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.