473,224 Members | 1,562 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,224 software developers and data experts.

Searching My XML File Using Keyword Searches?

Hi.

I am somewhat new to this and would like some advice.
I want to search my xml file using "keyword" search and
return results based on "proximity matching" - in other words,
since the search string will often not produce a direct match,
the results will be based on proximity (50%, 20% 100%, etc).

are there any good examples out there on how to do keyword
searches on XML data? How should i set up my xml file so
as to make a tag as likely as possible to match a related
search term?

and finally, how is proximity determined?

I know that this is a heady question, but i am hoping that some
answers at least put me on the right track - possibly with
links, sample code, or examples.

thanks in advance.

Nov 10 '06 #1
7 2592
pbd22 wrote:
I want to search my xml file using "keyword" search and
return results based on "proximity matching"
I don't know of any off-the-shelf code for this purpose, so you may be
stuck with implementing it yourself based on basic XML APIs and/or as a
complicated stylesheet.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 10 '06 #2

Hi.

Thanks.

I figured this. I have done a bit of searching and it seems that XPath,
XML/XSLT and CSS
are the way to go.

I am very new to this and have a follow up question. I am trying to
just get going and
am having trouble getting an example to work.

I was attempting to load categories.xml from my server using the below
commented out code (at the bottom of the page). It looks right, but, it
fails at xmlDoc = new ActiveXObject(...

So, it seems that the ActiveXObject part is causing it to fail. I then
added the below
"testing" code from another web site to expore what to use for my
xmlhttp variable
and i get the categories.xml file from the server just fine.

but, i need to be able to use the MSXML API as in the commented code.
How do
i access "Msxml2.DOMDocument.6.0"? What am i doing wrong?

Thanks.

<script type="text/javascript">

function BuildDocument() {

var xmlhttp=false;

try {
xmlhttp = new ActiveXObject("Msxml2.DOMDocument.6.0");
} catch (e) {
try {
xmlhttp = new ActiveXObject("Msxml2.DOMDocument.6.0");
} catch (E) {
xmlhttp = false;
}
}

if (!xmlhttp && typeof XMLHttpRequest!='undefined') {
try {
xmlhttp = new XMLHttpRequest();
} catch (e) {
xmlhttp=false;
}
}
if (!xmlhttp && window.createRequest) {
try {
xmlhttp = window.createRequest();
} catch (e) {
xmlhttp=false;
}
}

xmlhttp.open("GET", "categories.xml", true);
xmlhttp.onreadystatechange=function() {

if (xmlhttp.readyState==4) {
document.getElementById('results').innerHTML =
xmlhttp.responseText;
}

}

xmlhttp.send(null)

__________________________________

COMMENTED OUT CODE:
__________________________________

/************************************************

// Load XML

var xmlDoc = new ActiveXObject("Msxml2.DOMDocument.6.0");
xmlDoc.async = false;
xmlDoc.validateOnParse = false;
xmlDoc.load("categories.xml");
xml.async = false;
xml.load("categories.xml");

// Load XSL
var xsl = new ActiveXObject("Msxml2.DOMDocument.6.0");
xsl.async = false;
xsl.load("categories.xsl");

// Transform
document.write(xml.transformNode(xsl));

*************************************************/

}
Joe Kesselman wrote:
pbd22 wrote:
I want to search my xml file using "keyword" search and
return results based on "proximity matching"

I don't know of any off-the-shelf code for this purpose, so you may be
stuck with implementing it yourself based on basic XML APIs and/or as a
complicated stylesheet.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 10 '06 #3
pbd22 wrote:
Hi.

Thanks.

I figured this. I have done a bit of searching and it seems that XPath,
XML/XSLT and CSS are the way to go.
For a single file this will probably work, but for anything bigger
(eg a folder-full) you really need an indexing engine, otherwise it
will take forever.

The problem with proximity search in marked text is to decide what
"proximate" means. If you allow proximity to bleed over markup
boundaries, you increase the number of hits but you risk them being
inaccurate or misleading. For example if you search for "character
function" with proximity set to more than 12 words, the text

<para>...stuff...and his character was by far the strongest
in the play.</para>
</section>
</chapter>
<chapter>
<head>Set Design</head>
<para>The function of set design in Restoration drama...</para>

will produce a hit which computer scientists may not expect. IMHE
the acceptable limit is to allow proximity to bleed across markup
in mixed content plus the first higher level of element content.
This would allow it to operate across (for example) adjacent
paragraphs, but not across adjacent sections or chapters.

This has implications for the indexing engine, as it needs to store
not only the character offsets of words but also their markup depth
and adjacency. Very few manage to do this correctly, despite the
original technique having been implemented a long time ago (PAT).

///Peter
--
XML FAQ: http://xml.silmaril.ie/
Nov 11 '06 #4
hi peter.

ok, thanks. well, i guess then i am in luck (kind of).
i am only doing a search on a single file (categories.xml).
the file, however, is very large an quite detailed - there are
sub categories of sub categories of sub categories and so on.

the good news is that the file does not take user input. or,
for that matter, any text at all. it simply servs as a way for
users to search a term, say, "Hard Drive" and find what
categories of the many available match that term. the response
from the server should be as many (remotely) related paths
as possible and their associated relevancy rank:

1) Technology Hardware Hard Drives
100%
2) Cinema Movies Features "Hard Drive" 93%
3) Books Politics Elections "Hard Drive" 90%
4) Books Sports Swimming Biography 36%
5) Media News International Art
30%
6) Music New Age
4%

so, my example doesnt really match yours in the sense that
paragraphs with massive contextual differences could produce very
misleading
results.

What i do need to understand is how to rank such a search. How would
the logic
work for scoring number (2) as 93% and (3) as 90%, say? Should i be
including
a series of related words in the XML for each topic - those with "more"
related words get a higher rank? That seems very crude. I'll do
research on Indexing Engines but, based
on what you said, it seems like it may be overkill since i am working
with a single
file (categories.xml and categories.xsl) and am not dealing with wordy
paragraphs.

thanks again.
Peter Flynn wrote:
pbd22 wrote:
Hi.

Thanks.

I figured this. I have done a bit of searching and it seems that XPath,
XML/XSLT and CSS are the way to go.

For a single file this will probably work, but for anything bigger
(eg a folder-full) you really need an indexing engine, otherwise it
will take forever.

The problem with proximity search in marked text is to decide what
"proximate" means. If you allow proximity to bleed over markup
boundaries, you increase the number of hits but you risk them being
inaccurate or misleading. For example if you search for "character
function" with proximity set to more than 12 words, the text

<para>...stuff...and his character was by far the strongest
in the play.</para>
</section>
</chapter>
<chapter>
<head>Set Design</head>
<para>The function of set design in Restoration drama...</para>

will produce a hit which computer scientists may not expect. IMHE
the acceptable limit is to allow proximity to bleed across markup
in mixed content plus the first higher level of element content.
This would allow it to operate across (for example) adjacent
paragraphs, but not across adjacent sections or chapters.

This has implications for the indexing engine, as it needs to store
not only the character offsets of words but also their markup depth
and adjacency. Very few manage to do this correctly, despite the
original technique having been implemented a long time ago (PAT).

///Peter
--
XML FAQ: http://xml.silmaril.ie/
Nov 12 '06 #5
pbd22 wrote:
What i do need to understand is how to rank such a search. How would
the logic
work for scoring number (2) as 93% and (3) as 90%, say?
That's an application design issue, not an XML issue per se.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 12 '06 #6

ok, fair enough.

i was just hoping that somebody could give me some ideas
about how to sturcture my categories.xml file for the kind of
search i am trying to do.

another poster provided some useful code for the XSL file (below).
but now, if somebody could show me how to pass the value from
the user's search string on the client to the XSL file and, how to
structure the XML file for the kind of "proximity searching" that
i was discussing wiht Peter. Should each node contain a string
of key words?

If this is an application design issue and not an XML problem, fair
enough. otherwise, advice appreciated.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8"
indent="yes"/>
<!--change <xsl:variable name="data" select="'met sport baseball'"/>
in
<xsl:param name="data"/>-->
<xsl:variable name="data" select="'met sport baseball'"/>
<xsl:variable name="upperCase"
select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
<xsl:variable name="lowerCase"
select="'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="test"
select="translate($data,$upperCase,$lowerCase)"/>
<xsl:template match="/">
<xsl:apply-templates select="*/*">
<xsl:with-param name="search" select="$test"/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match="*">
<xsl:param name="search"/>
<xsl:variable name="result">
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb" select="$search"/>
<xsl:with-param name="Send" />
<xsl:with-param name="val" select="."/>
</xsl:call-template>
</xsl:variable>
<xsl:if test="string($result)=''">
trouvé <xsl:value-of select="."/>
</xsl:if>
</xsl:template>
<xsl:template match="*[@title]">
<xsl:param name="search"/>
<xsl:variable name="result">
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb" select="$search"/>
<xsl:with-param name="Send" />
<xsl:with-param name="val" select="@title"/>
</xsl:call-template>
</xsl:variable>
<xsl:choose>
<xsl:when test="string($result)=''">
trouvé <xsl:value-of select="@title"/>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="*">
<xsl:with-param name="search" select="string($result)"/>
</xsl:apply-templates>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template name="searching">
<xsl:param name="Sdeb"/>
<xsl:param name="Send"/>
<xsl:param name="val"/>
<xsl:variable name="trans">
<xsl:choose>
<xsl:when test="contains($Sdeb,' ')">
<xsl:value-of select="substring-before($Sdeb,' ')"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$Sdeb"/>
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<xsl:variable name="word" select="string($trans)"/>
<xsl:choose>
<xsl:when
test="contains(translate($val,$upperCase,$lowerCas e),$word)">
<xsl:choose>
<xsl:when test="$Sdeb=$word">
<xsl:value-of select="$Send"/>
</xsl:when>
<xsl:otherwise>
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb"
select="substring-after($Sdeb,' ')"/>
<xsl:with-param name="Send" select="$Send"/>
<xsl:with-param name="val" select="$val"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
<xsl:otherwise>
<xsl:choose>
<xsl:when test="$Sdeb=$word">
<xsl:value-of select="concat($Send,' ',$word)"/>
</xsl:when>
<xsl:otherwise>
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb"
select="substring-after($Sdeb,' ')"/>
<xsl:with-param name="Send" select="concat($Send,'
',$word)"/>
<xsl:with-param name="val" select="$val"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>

Joe Kesselman wrote:
pbd22 wrote:
What i do need to understand is how to rank such a search. How would
the logic
work for scoring number (2) as 93% and (3) as 90%, say?

That's an application design issue, not an XML issue per se.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 12 '06 #7
pbd22 wrote:
but now, if somebody could show me how to pass the value from
the user's search string on the client to the XSL file
Look up "stylesheet parameters". The exact syntax for passing them in
varies from one XSLT processor to another, but the XSL syntax is the
same in all processors.

Getting it from the client to a server is, presumably, standard client
forms and server programming.
and, how to
structure the XML file for the kind of "proximity searching" that
i was discussing wiht Peter.
As I say, I think that's drifting off from XML into basic programming
and data-structure design. Others may, of course, disagree.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Nov 12 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Michi | last post by:
I was wondering what the best solution is for making large numbers of TEXT (or BLOB?) fields searchable. For example, if I have a forum, what is the best way to be able to search for specific...
0
by: Adam | last post by:
I am currently determining the architecture for a rewrite of an existing retailed software product - moving from Smalltalk to C# .Net. It is to be a rich-client single-user desktop application. The...
4
by: James | last post by:
We have a need to search through an entire drive for a specific file name. The process is currently written with recursive loops through each directory and the Scripting.FileSystemObject. Problem...
3
by: googleboy | last post by:
Hi there. I have defined a class called Item with several (about 30 I think) different attributes (is that the right word in this context?). An abbreviated example of the code for this is: ...
5
by: jayjay | last post by:
I'm trying to help a friend setup a database to track resumes. The candidates will submit their resume in a Word doc format, and I'd like to make a search that will do a context search of the...
11
by: Michele and John | last post by:
I would like to write a C++ program that searches for the variable "state != 0" in a text file, and then go back 3 steps each time to read "count". The program should create a new file with "state ...
5
by: justobservant | last post by:
When more than one keyword is typed into a search-query, most of the search-results displayed indicate specified keywords scattered throughout an entire website of content i.e., this is shown as...
4
by: Hunk | last post by:
Hi I have a binary file which contains records sorted by Identifiers which are strings. The Identifiers are stored in ascending order. I would have to write a routine to give the record given...
1
by: alamodgal | last post by:
hiiiiiii I have a problem in highlighting searching keyword.Actually im using this function for searching Public Function HighLight(ByVal Keyword As String, ByVal ContentFor As String) Dim...
1
isladogs
by: isladogs | last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, Mike...
0
by: veera ravala | last post by:
ServiceNow is a powerful cloud-based platform that offers a wide range of services to help organizations manage their workflows, operations, and IT services more efficiently. At its core, ServiceNow...
0
by: VivesProcSPL | last post by:
Obviously, one of the original purposes of SQL is to make data query processing easy. The language uses many English-like terms and syntax in an effort to make it easy to learn, particularly for...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.