469,646 Members | 1,197 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,646 developers. It's quick & easy.

Searching My XML File Using Keyword Searches?

Hi.

I am somewhat new to this and would like some advice.
I want to search my xml file using "keyword" search and
return results based on "proximity matching" - in other words,
since the search string will often not produce a direct match,
the results will be based on proximity (50%, 20% 100%, etc).

are there any good examples out there on how to do keyword
searches on XML data? How should i set up my xml file so
as to make a tag as likely as possible to match a related
search term?

and finally, how is proximity determined?

I know that this is a heady question, but i am hoping that some
answers at least put me on the right track - possibly with
links, sample code, or examples.

thanks in advance.

Nov 10 '06 #1
7 2403
pbd22 wrote:
I want to search my xml file using "keyword" search and
return results based on "proximity matching"
I don't know of any off-the-shelf code for this purpose, so you may be
stuck with implementing it yourself based on basic XML APIs and/or as a
complicated stylesheet.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 10 '06 #2

Hi.

Thanks.

I figured this. I have done a bit of searching and it seems that XPath,
XML/XSLT and CSS
are the way to go.

I am very new to this and have a follow up question. I am trying to
just get going and
am having trouble getting an example to work.

I was attempting to load categories.xml from my server using the below
commented out code (at the bottom of the page). It looks right, but, it
fails at xmlDoc = new ActiveXObject(...

So, it seems that the ActiveXObject part is causing it to fail. I then
added the below
"testing" code from another web site to expore what to use for my
xmlhttp variable
and i get the categories.xml file from the server just fine.

but, i need to be able to use the MSXML API as in the commented code.
How do
i access "Msxml2.DOMDocument.6.0"? What am i doing wrong?

Thanks.

<script type="text/javascript">

function BuildDocument() {

var xmlhttp=false;

try {
xmlhttp = new ActiveXObject("Msxml2.DOMDocument.6.0");
} catch (e) {
try {
xmlhttp = new ActiveXObject("Msxml2.DOMDocument.6.0");
} catch (E) {
xmlhttp = false;
}
}

if (!xmlhttp && typeof XMLHttpRequest!='undefined') {
try {
xmlhttp = new XMLHttpRequest();
} catch (e) {
xmlhttp=false;
}
}
if (!xmlhttp && window.createRequest) {
try {
xmlhttp = window.createRequest();
} catch (e) {
xmlhttp=false;
}
}

xmlhttp.open("GET", "categories.xml", true);
xmlhttp.onreadystatechange=function() {

if (xmlhttp.readyState==4) {
document.getElementById('results').innerHTML =
xmlhttp.responseText;
}

}

xmlhttp.send(null)

__________________________________

COMMENTED OUT CODE:
__________________________________

/************************************************

// Load XML

var xmlDoc = new ActiveXObject("Msxml2.DOMDocument.6.0");
xmlDoc.async = false;
xmlDoc.validateOnParse = false;
xmlDoc.load("categories.xml");
xml.async = false;
xml.load("categories.xml");

// Load XSL
var xsl = new ActiveXObject("Msxml2.DOMDocument.6.0");
xsl.async = false;
xsl.load("categories.xsl");

// Transform
document.write(xml.transformNode(xsl));

*************************************************/

}
Joe Kesselman wrote:
pbd22 wrote:
I want to search my xml file using "keyword" search and
return results based on "proximity matching"

I don't know of any off-the-shelf code for this purpose, so you may be
stuck with implementing it yourself based on basic XML APIs and/or as a
complicated stylesheet.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 10 '06 #3
pbd22 wrote:
Hi.

Thanks.

I figured this. I have done a bit of searching and it seems that XPath,
XML/XSLT and CSS are the way to go.
For a single file this will probably work, but for anything bigger
(eg a folder-full) you really need an indexing engine, otherwise it
will take forever.

The problem with proximity search in marked text is to decide what
"proximate" means. If you allow proximity to bleed over markup
boundaries, you increase the number of hits but you risk them being
inaccurate or misleading. For example if you search for "character
function" with proximity set to more than 12 words, the text

<para>...stuff...and his character was by far the strongest
in the play.</para>
</section>
</chapter>
<chapter>
<head>Set Design</head>
<para>The function of set design in Restoration drama...</para>

will produce a hit which computer scientists may not expect. IMHE
the acceptable limit is to allow proximity to bleed across markup
in mixed content plus the first higher level of element content.
This would allow it to operate across (for example) adjacent
paragraphs, but not across adjacent sections or chapters.

This has implications for the indexing engine, as it needs to store
not only the character offsets of words but also their markup depth
and adjacency. Very few manage to do this correctly, despite the
original technique having been implemented a long time ago (PAT).

///Peter
--
XML FAQ: http://xml.silmaril.ie/
Nov 11 '06 #4
hi peter.

ok, thanks. well, i guess then i am in luck (kind of).
i am only doing a search on a single file (categories.xml).
the file, however, is very large an quite detailed - there are
sub categories of sub categories of sub categories and so on.

the good news is that the file does not take user input. or,
for that matter, any text at all. it simply servs as a way for
users to search a term, say, "Hard Drive" and find what
categories of the many available match that term. the response
from the server should be as many (remotely) related paths
as possible and their associated relevancy rank:

1) Technology Hardware Hard Drives
100%
2) Cinema Movies Features "Hard Drive" 93%
3) Books Politics Elections "Hard Drive" 90%
4) Books Sports Swimming Biography 36%
5) Media News International Art
30%
6) Music New Age
4%

so, my example doesnt really match yours in the sense that
paragraphs with massive contextual differences could produce very
misleading
results.

What i do need to understand is how to rank such a search. How would
the logic
work for scoring number (2) as 93% and (3) as 90%, say? Should i be
including
a series of related words in the XML for each topic - those with "more"
related words get a higher rank? That seems very crude. I'll do
research on Indexing Engines but, based
on what you said, it seems like it may be overkill since i am working
with a single
file (categories.xml and categories.xsl) and am not dealing with wordy
paragraphs.

thanks again.
Peter Flynn wrote:
pbd22 wrote:
Hi.

Thanks.

I figured this. I have done a bit of searching and it seems that XPath,
XML/XSLT and CSS are the way to go.

For a single file this will probably work, but for anything bigger
(eg a folder-full) you really need an indexing engine, otherwise it
will take forever.

The problem with proximity search in marked text is to decide what
"proximate" means. If you allow proximity to bleed over markup
boundaries, you increase the number of hits but you risk them being
inaccurate or misleading. For example if you search for "character
function" with proximity set to more than 12 words, the text

<para>...stuff...and his character was by far the strongest
in the play.</para>
</section>
</chapter>
<chapter>
<head>Set Design</head>
<para>The function of set design in Restoration drama...</para>

will produce a hit which computer scientists may not expect. IMHE
the acceptable limit is to allow proximity to bleed across markup
in mixed content plus the first higher level of element content.
This would allow it to operate across (for example) adjacent
paragraphs, but not across adjacent sections or chapters.

This has implications for the indexing engine, as it needs to store
not only the character offsets of words but also their markup depth
and adjacency. Very few manage to do this correctly, despite the
original technique having been implemented a long time ago (PAT).

///Peter
--
XML FAQ: http://xml.silmaril.ie/
Nov 12 '06 #5
pbd22 wrote:
What i do need to understand is how to rank such a search. How would
the logic
work for scoring number (2) as 93% and (3) as 90%, say?
That's an application design issue, not an XML issue per se.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 12 '06 #6

ok, fair enough.

i was just hoping that somebody could give me some ideas
about how to sturcture my categories.xml file for the kind of
search i am trying to do.

another poster provided some useful code for the XSL file (below).
but now, if somebody could show me how to pass the value from
the user's search string on the client to the XSL file and, how to
structure the XML file for the kind of "proximity searching" that
i was discussing wiht Peter. Should each node contain a string
of key words?

If this is an application design issue and not an XML problem, fair
enough. otherwise, advice appreciated.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8"
indent="yes"/>
<!--change <xsl:variable name="data" select="'met sport baseball'"/>
in
<xsl:param name="data"/>-->
<xsl:variable name="data" select="'met sport baseball'"/>
<xsl:variable name="upperCase"
select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
<xsl:variable name="lowerCase"
select="'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="test"
select="translate($data,$upperCase,$lowerCase)"/>
<xsl:template match="/">
<xsl:apply-templates select="*/*">
<xsl:with-param name="search" select="$test"/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match="*">
<xsl:param name="search"/>
<xsl:variable name="result">
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb" select="$search"/>
<xsl:with-param name="Send" />
<xsl:with-param name="val" select="."/>
</xsl:call-template>
</xsl:variable>
<xsl:if test="string($result)=''">
trouvé <xsl:value-of select="."/>
</xsl:if>
</xsl:template>
<xsl:template match="*[@title]">
<xsl:param name="search"/>
<xsl:variable name="result">
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb" select="$search"/>
<xsl:with-param name="Send" />
<xsl:with-param name="val" select="@title"/>
</xsl:call-template>
</xsl:variable>
<xsl:choose>
<xsl:when test="string($result)=''">
trouvé <xsl:value-of select="@title"/>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="*">
<xsl:with-param name="search" select="string($result)"/>
</xsl:apply-templates>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template name="searching">
<xsl:param name="Sdeb"/>
<xsl:param name="Send"/>
<xsl:param name="val"/>
<xsl:variable name="trans">
<xsl:choose>
<xsl:when test="contains($Sdeb,' ')">
<xsl:value-of select="substring-before($Sdeb,' ')"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$Sdeb"/>
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<xsl:variable name="word" select="string($trans)"/>
<xsl:choose>
<xsl:when
test="contains(translate($val,$upperCase,$lowerCas e),$word)">
<xsl:choose>
<xsl:when test="$Sdeb=$word">
<xsl:value-of select="$Send"/>
</xsl:when>
<xsl:otherwise>
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb"
select="substring-after($Sdeb,' ')"/>
<xsl:with-param name="Send" select="$Send"/>
<xsl:with-param name="val" select="$val"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
<xsl:otherwise>
<xsl:choose>
<xsl:when test="$Sdeb=$word">
<xsl:value-of select="concat($Send,' ',$word)"/>
</xsl:when>
<xsl:otherwise>
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb"
select="substring-after($Sdeb,' ')"/>
<xsl:with-param name="Send" select="concat($Send,'
',$word)"/>
<xsl:with-param name="val" select="$val"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>

Joe Kesselman wrote:
pbd22 wrote:
What i do need to understand is how to rank such a search. How would
the logic
work for scoring number (2) as 93% and (3) as 90%, say?

That's an application design issue, not an XML issue per se.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 12 '06 #7
pbd22 wrote:
but now, if somebody could show me how to pass the value from
the user's search string on the client to the XSL file
Look up "stylesheet parameters". The exact syntax for passing them in
varies from one XSLT processor to another, but the XSL syntax is the
same in all processors.

Getting it from the client to a server is, presumably, standard client
forms and server programming.
and, how to
structure the XML file for the kind of "proximity searching" that
i was discussing wiht Peter.
As I say, I think that's drifting off from XML into basic programming
and data-structure design. Others may, of course, disagree.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 12 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Michi | last post: by
reply views Thread by Adam | last post: by
4 posts views Thread by James | last post: by
3 posts views Thread by googleboy | last post: by
5 posts views Thread by jayjay | last post: by
11 posts views Thread by Michele and John | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.