473,390 Members | 1,176 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,390 software developers and data experts.

XSL for removing words less than 4 letters in a sitemap

I need to transform this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</
loc>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
</url>
</urlset>

into this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Books/Paths-for-the-
extreme-player</loc>
<news:news>
<news:keywords>Books, Paths, extreme, player</
news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
<news:news>
<news:keywords>Games, edge, wall</news:keywords>
</news:news>
</url>
</urlset>

I mean, I need a template for creating a <news:keywordstag which
contents all the words from <loctag with words of more than 3
letters.
Apr 1 '08 #1
6 2221
Olagato wrote:
I need to transform this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</
loc>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
</url>
</urlset>

into this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Books/Paths-for-the-
extreme-player</loc>
<news:news>
<news:keywords>Books, Paths, extreme, player</
news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...e-edge-of-the-
wall</loc>
<news:news>
<news:keywords>Games, edge, wall</news:keywords>
</news:news>
</url>
</urlset>

I mean, I need a template for creating a <news:keywordstag which
contents all the words from <loctag with words of more than 3
letters.
Do you want to use XSLT 2.0 or 1.0?
What about words like 'localhost' or 'index', how do you decide that
those are not taken?

Here is an XSLT 2.0 stylesheet that should show you an approach using
the tokenize method:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://example.com/2008/news"
xmlns:sm="http://www.google.com/schemas/sitemap/0.84"
exclude-result-prefixes="sm"
version="2.0">

<xsl:output method="xml" indent="yes"/>

<xsl:strip-space elements="*"/>

<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>

<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
<news:news>
<news:keywords>
<xsl:value-of
select="for $s in tokenize(sm:loc, '/')[position() &gt; 5]
return tokenize($s, '[\-/]')[string-length(.) &gt; 3]"
separator=", "/>
</news:keywords>
</news:news>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

Result with Saxon 9 when run against your posted input sample (with a
'root' element added and a namespace choosen for the 'news' prefix) is

<root>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>

<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</loc>
<news:news xmlns:news="http://example.com/2008/news">
<news:keywords>Paths, extreme, player</news:keywords>
</news:news>
</url>
<url>

<loc>http://localhost/index.php/index.php/Games/The-edge-of-the-wall</loc>
<news:news xmlns:news="http://example.com/2008/news">
<news:keywords>Games, edge, wall</news:keywords>
</news:news>
</url>
</urlset>
</root>
--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 2 '08 #2
Olagato wrote:
>Do you want to use XSLT 2.0 or 1.0?
I'm using XSLT 1.0
>What about words like 'localhost' or 'index', how do you decide that those are not taken?
It's not a problem now. Maybe a sentence like next:
translate( translate( substring-after( sm:loc, 'http://localhost/
index.php/index.php/') ,'-', ',') ,'/',',')

I'm trying your XSL from PHP without success:
PHP only supports XSLT 1.0 so my posted stylesheet using XSLT and XPath
2.0 functionality does not work with PHP's XSLT processor.
--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 3 '08 #3
On 3 abr, 13:06, Martin Honnen <mahotr...@yahoo.dewrote:
Olagato wrote:
Do you want to use XSLT 2.0 or 1.0?
I'm using XSLT 1.0
What about words like 'localhost' or 'index', how do you decide that those are not taken?
It's not a problem now. Maybe a sentence like next:
translate( translate( substring-after( sm:loc, 'http://localhost/
index.php/index.php/') ,'-', ',') ,'/',',')
I'm trying your XSL from PHP without success:

PHP only supports XSLT 1.0 so my posted stylesheet using XSLT and XPath
2.0 functionality does not work with PHP's XSLT processor.

--

Martin Honnen
http://JavaScript.FAQTs.com/
Your posted version in 1.0 functionality seems to be quite difficult
to implement because of lack of advanced functions (at least for a xsl
newbie like me) So my only alternative would be to use a XSLT
processor. I'll try Xalan on server: http://xalan.apache.org/
Any other idea using XSLT 1.0 will be appreciated.
Apr 3 '08 #4
Olagato wrote:
Your posted version in 1.0 functionality seems to be quite difficult
to implement because of lack of advanced functions (at least for a xsl
newbie like me) So my only alternative would be to use a XSLT
processor. I'll try Xalan on server: http://xalan.apache.org/
Any other idea using XSLT 1.0 will be appreciated.
Xalan does not do XSLT 2.0 so if you want to use XSLT 2.0 then try Saxon
(http://saxon.sourceforge.net/) or Gestalt
(http://gestalt.sourceforge.net/) or AltovaXML
(http://www.altova.com/altovaxml.html).

If you want to use PHP then I think PHP supports EXSLT so you could try
to use http://www.exslt.org/str/functions/tokenize/index.html

--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 3 '08 #5
On 3 abr, 16:45, Martin Honnen <mahotr...@yahoo.dewrote:
Olagato wrote:
Your posted version in 1.0 functionality seems to be quite difficult
to implement because of lack of advanced functions (at least for a xsl
newbie like me) So my only alternative would be to use a XSLT
processor. I'll try Xalan on server:http://xalan.apache.org/
Any other idea using XSLT 1.0 will be appreciated.

Xalan does not do XSLT 2.0 so if you want to use XSLT 2.0 then try Saxon
(http://saxon.sourceforge.net/) or Gestalt
(http://gestalt.sourceforge.net/) or AltovaXML
(http://www.altova.com/altovaxml.html).

If you want to use PHP then I think PHP supports EXSLT so you could try
to usehttp://www.exslt.org/str/functions/tokenize/index.html

--

Martin Honnen
http://JavaScript.FAQTs.com/
Thank you very much, Martin
It's now working fine with Altova XML Spy and Saxon9 as external XSLT
parser:
http://216.239.59.104/search?q=cache...ient=firefox-a

There are only 2 little issues left:

My XML input is:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost/index.php/index.php...site/Rutas-de-
verano-en-España</loc>
<lastmod>2008-03-13</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://localhost/index.php/index.php...site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<lastmod>2008-02-12</lastmod>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
</url>
</urlset>

Your XSLT 2.0 is:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9" exclude-result-
prefixes="sm" version="2.0">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
<news:news>
<news:publication_date>
<xsl:value-of select="sm:lastmod"/>
</news:publication_date>
<news:keywords>
<xsl:value-of select="for $s in tokenize(sm:loc, '/')[position()
&gt; 5]
return tokenize($s, '[\-/]')[string-length(.)
&gt; 3]" separator=", "/>
</news:keywords>
</news:news>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

The output is:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost/index.php/index.php...site/Rutas-de-
verano-en-España</loc>
<lastmod>2008-03-13</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
<news:news xmlns:news="http://www.google.com/schemas/sitemap-news/
0.9">
<news:publication_date>2008-03-13</news:publication_date>
<news:keywords>ezwebin_site, Rutas, verano, España</news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<lastmod>2008-02-12</lastmod>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
<news:news xmlns:news="http://www.google.com/schemas/sitemap-news/
0.9">
<news:publication_date>2008-02-12</news:publication_date>
<news:keywords>ezwebin_site, Rutas, Camino, Santiago, rt</
news:keywords>
</news:news>
</url>
</urlset>

But I need an output like defined by News Sitemap Protocol:
http://www.google.com/support/webmas...y?answer=42738

So there are 2 things left:
1- <lastmodtags should dissapear from <urloutputs because a
<news:publication_datetag has been defined already.
2- xmlns:news namespace should dissapear from <news:newstags and it
should be taken to the <urlset xmlns="http://www.sitemaps.org/schemas/
sitemap/0.9"tag in the header.

A good output file would be:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
<url>
<loc>http://localhost/index.php/index.php...site/Rutas-de-
verano-en-España</loc>
<changefreq>daily</changefreq>
<priority>0.8</priority>
<news:news>
<news:publication_date>2008-03-13</news:publication_date>
<news:keywords>ezwebin_site, Rutas, verano, España</news:keywords>
</news:news>
</url>
<url>
<loc>http://localhost/index.php/index.php...site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
<news:news>
<news:publication_date>2008-02-12</news:publication_date>
<news:keywords>ezwebin_site, Rutas, Camino, Santiago, rt</
news:keywords>
</news:news>
</url>
</urlset>

Any idea ?


Apr 8 '08 #6
Olagato wrote:
So there are 2 things left:
1- <lastmodtags should dissapear from <urloutputs because a
<news:publication_datetag has been defined already.
2- xmlns:news namespace should dissapear from <news:newstags and it
should be taken to the <urlset xmlns="http://www.sitemaps.org/schemas/
sitemap/0.9"tag in the header.
Both are easy adaptions, you need to use a predicate
[not(self::sm:lastmod)] and you can use xsl:namespace to make sure a
namespace declaration is created on the root element:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9"
exclude-result-prefixes="sm"
version="2.0">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:urlset">
<xsl:copy>
<xsl:namespace name="news"
select="'http://www.google.com/schemas/sitemap-news/0.9'"/>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()[not(self::sm:lastmod)]"/>
<news:news>
<news:publication_date>
<xsl:value-of select="sm:lastmod"/>
</news:publication_date>
<news:keywords>
<xsl:value-of select="for $s in tokenize(sm:loc, '/')[position()
&gt; 5]
return tokenize($s, '[\-/]')[string-length(.)
&gt; 3]" separator=", "/>
</news:keywords>
</news:news>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
--

Martin Honnen
http://JavaScript.FAQTs.com/
Apr 9 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Aristotle | last post by:
Could you please help me out with regular expressions. I'm trying to write a perl script that proccesses some text, and i'm stuck at the following: need to remove from the text 1. dots followed...
9
by: dan | last post by:
this is a program to count average letters per word. i am able to count the total number of letters, but not words. How do you count the total number of words in a text file, so i am able to divide...
2
by: Jazzdrums | last post by:
Hello, I've (parts of ) HTML documents and a list of words that I have to transform as an hyperlinks, i.e. surround them with a "<a href="...">" "</a>". A first simple approach is to parse the...
4
by: OpticTygre | last post by:
I need to write a loop that prints all the combination possibilities of a character array. Basically, taking a scrambled word, or a regular word, and printing out all the combinations. The...
0
by: Bill Mild | last post by:
How do I write a derived data source so that I can remove a node from a sitemap data source? Basically, I have a situation where the built-in security trimming is not exactly what I need. I need...
5
by: JJ | last post by:
Although this question involves Flash, I suspect the actual issue is an asp one.. I am trying to open the web.sitemap file in an .swf file enbedded in an asp page (I'm working in VS 2005). I...
4
by: shapper | last post by:
Hello, I have 2 questions about Asp.Net 2.0 web.sitemap: 1. Where can I find the list of all siteMapNode attributes? I looked eveywhere and couldn't find it. 2. I created a Web.sitemap...
4
by: shapper | last post by:
Hello, I am trying to convert an Asp.Net 2.0 XML sitemap file to a Google's sitemap file. I am posting the formats of both files. 1. How can I do the conversion? 2. And can I use an...
1
by: sumone14 | last post by:
I have to create a program that opens a file and I have to find and show the words that have the most letters. I got the file to open but I can't figure out how to count the letters. I think I have...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.