Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old July 20th, 2005, 08:48 AM
Nash Kabbara
Guest
 
Posts: n/a
Default Searching XML

Hi all,

I just finished writing a log reader that reads xml logs (about 1 to 2 MB
large). At the command line you can specify the file name, the name of the
element and it's value like so: logreader log.txt MyElement myvalue

In retrospect, I've noticed that it takes a long time to process. The time
is spent on comparing the value of all tags named MyElement to myvalue.
Namely: ** *

NodeList nodeList = m_document.getElementsByTagName(MyElement);
for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
*** **if(getTextNode(nodeList.item(index)).trim().equa ls(myvalue))
//getTextNode merely return the text value of the node
*** **{
*** ***counter++;
*** ***tempIndex[arrIndex++] = index;
*** **}
*
This takes around 20 seconds to complete processing. So my question is, is
there some way where I can extract xml elements based on the element value.
For example XPATH allows you to chose elements based to attribute value, so
I'm wondering, is there a similar mechanism that allows you to grab
elements based on their value?


Thanks.
  #2  
Old July 20th, 2005, 08:48 AM
Jeff Kish
Guest
 
Posts: n/a
Default Re: Searching XML

On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <nash@oplink.net> wrote:
[color=blue]
>Hi all,
>
> I just finished writing a log reader that reads xml logs (about 1 to 2 MB
>large). At the command line you can specify the file name, the name of the
>element and it's value like so: logreader log.txt MyElement myvalue
>
> In retrospect, I've noticed that it takes a long time to process. The time
>is spent on comparing the value of all tags named MyElement to myvalue.
>Namely: ** *
>
>NodeList nodeList = m_document.getElementsByTagName(MyElement);
>for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
>*** **if(getTextNode(nodeList.item(index)).trim().equa ls(myvalue))
> //getTextNode merely return the text value of the node
>*** **{
>*** ***counter++;
>*** ***tempIndex[arrIndex++] = index;
>*** **}
>*
>This takes around 20 seconds to complete processing. So my question is, is
>there some way where I can extract xml elements based on the element value.
>For example XPATH allows you to chose elements based to attribute value, so
>I'm wondering, is there a similar mechanism that allows you to grab
>elements based on their value?
>
>
>Thanks.[/color]
Here is a query that selects data based on element values...

This XQuery (taken from a tutorial on the internet..don't recall exact doc/url):

for $b in document("books.xml")//book
where some $a in $b/author
satisfies ($a/last="Stevens" and $a/first="W.")
return $b/title

returns these results:

<title>TCP/IP Illustrated</title>,
<title>Advanced Programming in the UNIX Environment</title>


Using this data:

<bib>
<book year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>

<book year="1992">
<title>Advanced Programming in the UNIX Environment</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>

<book year="2000">
<title>Data on the Web</title>
<author><last>Abiteboul</last><first>Serge</first></author>
<author><last>Buneman</last><first>Peter</first></author>
<author><last>Suciu</last><first>Dan</first></author>
<publisher>Morgan Kaufmann Publishers</publisher>
<price>65.95</price>
</book>

<book year="1999">
<title>The Economics of Technology andContent for Digital TV</title>
<editor><last>Gerbarg</last>
<first>Darcy</first>
<affiliation>CITI</affiliation>
</editor>
<publisher>Kluwer Academic Publishers</publisher>
<price>129.95</price>
</book>

</bib>

HTH

  #3  
Old July 20th, 2005, 08:48 AM
Andy Dingley
Guest
 
Posts: n/a
Default Re: Searching XML

On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <nash@oplink.net>
wrote:
[color=blue]
>This takes around 20 seconds to complete processing.[/color]

I'm not surprised ! getElementsByTagName is always slow, but it's
also inefficient here because it's having to look everywhere in the
structure to find elements to test their names. If you can improve
the search by looking for elements as children or grand-children,
rather than searching everywhere for them, then this can be a good
tweak.

XML is often incredibly powerful, but this excess power can lead to
inefficiencies if it's being used "by default" when you didn't really
need it.
[color=blue]
> So my question is, is
>there some way where I can extract xml elements based on the element value.[/color]

Yes, XPath ! Just use "//MyElementName"

Or if MyElementName is supplied by the users, then use a [...]
predicate and the local-name() function to get the name of the
element, then compare it to the value of an element name supplied as a
parameter.

<xsl:param name="elmName" >MyElementName</xsl:param>
...
//*[local-name() = string($elmName)]


XQuery (and various other incarnations) will do it too, and with
better performance. However it's sometimes hard to find XQuery
features in an environment, but most will have XSLT and XPath
available.
  #4  
Old July 20th, 2005, 08:48 AM
Jeff Kish
Guest
 
Posts: n/a
Default Re: Searching XML

On Tue, 26 Oct 2004 12:09:25 +0100, Andy Dingley <dingbat@codesmiths.com>
wrote:
[color=blue]
>On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <nash@oplink.net>
>wrote:
>[color=green]
>>This takes around 20 seconds to complete processing.[/color]
>
>I'm not surprised ! getElementsByTagName is always slow, but it's
>also inefficient here because it's having to look everywhere in the
>structure to find elements to test their names. If you can improve
>the search by looking for elements as children or grand-children,
>rather than searching everywhere for them, then this can be a good
>tweak.
>
>XML is often incredibly powerful, but this excess power can lead to
>inefficiencies if it's being used "by default" when you didn't really
>need it.
>[color=green]
>> So my question is, is
>>there some way where I can extract xml elements based on the element value.[/color]
>
>Yes, XPath ! Just use "//MyElementName"
>
>Or if MyElementName is supplied by the users, then use a [...]
>predicate and the local-name() function to get the name of the
>element, then compare it to the value of an element name supplied as a
>parameter.
>
><xsl:param name="elmName" >MyElementName</xsl:param>
> ...
>//*[local-name() = string($elmName)]
>
>
>XQuery (and various other incarnations) will do it too, and with
>better performance. However it's sometimes hard to find XQuery
>features in an environment, but most will have XSLT and XPath
>available.[/color]
I like Andy's answer better.
Jeff Kish
  #5  
Old July 20th, 2005, 08:48 AM
Nash Kabbara
Guest
 
Posts: n/a
Default Re: Searching XML

Hi Andy,

Thanks for the response. Actually the lag is not in getElementsByTagName,
but by the loop I have that compares the values of the tags with what the
user is looking for (myvalue). So I was wondering if there's a built in
mechanism that pulls elements based on their Value. When I say "Value" I
mean their content, not their name. i.e <Element>value</Element>. Sorry for
not being clear. It seems your examples of xpath get elements base on their
name, but not value.


Nash
Andy Dingley wrote:
[color=blue]
> On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <nash@oplink.net>
> wrote:
>[color=green]
>>This takes around 20 seconds to complete processing.[/color]
>
> I'm not surprised ! getElementsByTagName is always slow, but it's
> also inefficient here because it's having to look everywhere in the
> structure to find elements to test their names. If you can improve
> the search by looking for elements as children or grand-children,
> rather than searching everywhere for them, then this can be a good
> tweak.
>
> XML is often incredibly powerful, but this excess power can lead to
> inefficiencies if it's being used "by default" when you didn't really
> need it.
>[color=green]
>> So my question is, is
>>there some way where I can extract xml elements based on the element
>>value.[/color]
>
> Yes, XPath ! Just use "//MyElementName"
>
> Or if MyElementName is supplied by the users, then use a [...]
> predicate and the local-name() function to get the name of the
> element, then compare it to the value of an element name supplied as a
> parameter.
>
> <xsl:param name="elmName" >MyElementName</xsl:param>
> ...
> //*[local-name() = string($elmName)]
>
>
> XQuery (and various other incarnations) will do it too, and with
> better performance. However it's sometimes hard to find XQuery
> features in an environment, but most will have XSLT and XPath
> available.[/color]

  #6  
Old July 20th, 2005, 08:48 AM
Andy Dingley
Guest
 
Posts: n/a
Default Re: Searching XML

On Tue, 26 Oct 2004 10:09:27 -0500, Nash Kabbara <nash@oplink.net>
wrote:
[color=blue]
> Thanks for the response. Actually the lag is not in getElementsByTagName,
>but by the loop I have that compares the values of the tags with what the
>user is looking for (myvalue).[/color]

I don't recognise the coding platform - what is it ?

There's a lot you can do to improve that loop.
- Use an iterator not an array index
- Be suspicious of that .getlength() method, especially in an array
bound. Is that a per-iteration overhead you've given yourself ?
- never trim() when you can rtrim()
- Never trim() when you can use a space-ignoring comparison instead.

The trouble with much XML optimisation is that it becomes sensitive to
the data you feed it. Do you have a lot of matching elements to walk
through, or is finding the set of elements the main problem ?

[color=blue]
> So I was wondering if there's a built in
>mechanism that pulls elements based on their Value. When I say "Value" I
>mean their content, not their name. i.e <Element>value</Element>.[/color]

Yes, XPath !

Use a similar predicate, "//*[string (.) = $elmContents]"

string() is optional (because in this context it's the default
behaviour) but it's good practice to use it in situations like this,
because it makes reading your code a lot clearer in the future.

--
Smert' spamionam
  #7  
Old July 20th, 2005, 08:48 AM
Tjerk Wolterink
Guest
 
Posts: n/a
Default Re: Searching XML

I think youre coding in Java,

It is better to use SAX: Simple Api for XML.
You then dont have to load the entire DOM,
and you can do some optimizations.

SAX is a good choice if it is not too complex what you want to do.

Greetz
Tjerk

Nash Kabbara wrote:[color=blue]
> Hi all,
>
> I just finished writing a log reader that reads xml logs (about 1 to 2 MB
> large). At the command line you can specify the file name, the name of the
> element and it's value like so: logreader log.txt MyElement myvalue
>
> In retrospect, I've noticed that it takes a long time to process. The time
> is spent on comparing the value of all tags named MyElement to myvalue.
> Namely:
>
> NodeList nodeList = m_document.getElementsByTagName(MyElement);
> for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
> if(getTextNode(nodeList.item(index)).trim().equals (myvalue))
> //getTextNode merely return the text value of the node
> {
> counter++;
> tempIndex[arrIndex++] = index;
> }
>
> This takes around 20 seconds to complete processing. So my question is, is
> there some way where I can extract xml elements based on the element value.
> For example XPATH allows you to chose elements based to attribute value, so
> I'm wondering, is there a similar mechanism that allows you to grab
> elements based on their value?
>
>
> Thanks.[/color]
  #8  
Old July 20th, 2005, 08:48 AM
Jeff Kish
Guest
 
Posts: n/a
Default Re: Searching XML

<snip>[color=blue]
>Yes, XPath !
>
>Use a similar predicate, "//*[string (.) = $elmContents]"
>
>string() is optional (because in this context it's the default
>behaviour) but it's good practice to use it in situations like this,
>because it makes reading your code a lot clearer in the future.[/color]
<snip>
lots of good info in this thread!
Yes, Sax if you don't need to load your entire object in memory.

Oh.. regarding xquery..

for $b in document("books.xml")//*[.="TCP/IP Illustrated"]
return
<temp>{string($b/.), name($b/.)}</temp>

{-- results in this output
<temp>TCP/IP Illustrated title</temp>
--}

Jeff Kish
 

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles