473,320 Members | 2,048 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Searching XML

Hi all,

I just finished writing a log reader that reads xml logs (about 1 to 2 MB
large). At the command line you can specify the file name, the name of the
element and it's value like so: logreader log.txt MyElement myvalue

In retrospect, I've noticed that it takes a long time to process. The time
is spent on comparing the value of all tags named MyElement to myvalue.
Namely: ** *

NodeList nodeList = m_document.getElementsByTagName(MyElement);
for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
*** **if(getTextNode(nodeList.item(index)).trim().equa ls(myvalue))
//getTextNode merely return the text value of the node
*** **{
*** ***counter++;
*** ***tempIndex[arrIndex++] = index;
*** **}
*
This takes around 20 seconds to complete processing. So my question is, is
there some way where I can extract xml elements based on the element value.
For example XPATH allows you to chose elements based to attribute value, so
I'm wondering, is there a similar mechanism that allows you to grab
elements based on their value?
Thanks.
Jul 20 '05 #1
7 2308
On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <na**@oplink.net> wrote:
Hi all,

I just finished writing a log reader that reads xml logs (about 1 to 2 MB
large). At the command line you can specify the file name, the name of the
element and it's value like so: logreader log.txt MyElement myvalue

In retrospect, I've noticed that it takes a long time to process. The time
is spent on comparing the value of all tags named MyElement to myvalue.
Namely: ** *

NodeList nodeList = m_document.getElementsByTagName(MyElement);
for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
*** **if(getTextNode(nodeList.item(index)).trim().equa ls(myvalue))
//getTextNode merely return the text value of the node
*** **{
*** ***counter++;
*** ***tempIndex[arrIndex++] = index;
*** **}
*
This takes around 20 seconds to complete processing. So my question is, is
there some way where I can extract xml elements based on the element value.
For example XPATH allows you to chose elements based to attribute value, so
I'm wondering, is there a similar mechanism that allows you to grab
elements based on their value?
Thanks.

Here is a query that selects data based on element values...

This XQuery (taken from a tutorial on the internet..don't recall exact doc/url):

for $b in document("books.xml")//book
where some $a in $b/author
satisfies ($a/last="Stevens" and $a/first="W.")
return $b/title

returns these results:

<title>TCP/IP Illustrated</title>,
<title>Advanced Programming in the UNIX Environment</title>
Using this data:

<bib>
<book year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>

<book year="1992">
<title>Advanced Programming in the UNIX Environment</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>

<book year="2000">
<title>Data on the Web</title>
<author><last>Abiteboul</last><first>Serge</first></author>
<author><last>Buneman</last><first>Peter</first></author>
<author><last>Suciu</last><first>Dan</first></author>
<publisher>Morgan Kaufmann Publishers</publisher>
<price>65.95</price>
</book>

<book year="1999">
<title>The Economics of Technology andContent for Digital TV</title>
<editor><last>Gerbarg</last>
<first>Darcy</first>
<affiliation>CITI</affiliation>
</editor>
<publisher>Kluwer Academic Publishers</publisher>
<price>129.95</price>
</book>

</bib>

HTH

Jul 20 '05 #2
On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <na**@oplink.net>
wrote:
This takes around 20 seconds to complete processing.
I'm not surprised ! getElementsByTagName is always slow, but it's
also inefficient here because it's having to look everywhere in the
structure to find elements to test their names. If you can improve
the search by looking for elements as children or grand-children,
rather than searching everywhere for them, then this can be a good
tweak.

XML is often incredibly powerful, but this excess power can lead to
inefficiencies if it's being used "by default" when you didn't really
need it.
So my question is, is
there some way where I can extract xml elements based on the element value.


Yes, XPath ! Just use "//MyElementName"

Or if MyElementName is supplied by the users, then use a [...]
predicate and the local-name() function to get the name of the
element, then compare it to the value of an element name supplied as a
parameter.

<xsl:param name="elmName" >MyElementName</xsl:param>
...
//*[local-name() = string($elmName)]
XQuery (and various other incarnations) will do it too, and with
better performance. However it's sometimes hard to find XQuery
features in an environment, but most will have XSLT and XPath
available.
Jul 20 '05 #3
On Tue, 26 Oct 2004 12:09:25 +0100, Andy Dingley <di*****@codesmiths.com>
wrote:
On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <na**@oplink.net>
wrote:
This takes around 20 seconds to complete processing.


I'm not surprised ! getElementsByTagName is always slow, but it's
also inefficient here because it's having to look everywhere in the
structure to find elements to test their names. If you can improve
the search by looking for elements as children or grand-children,
rather than searching everywhere for them, then this can be a good
tweak.

XML is often incredibly powerful, but this excess power can lead to
inefficiencies if it's being used "by default" when you didn't really
need it.
So my question is, is
there some way where I can extract xml elements based on the element value.


Yes, XPath ! Just use "//MyElementName"

Or if MyElementName is supplied by the users, then use a [...]
predicate and the local-name() function to get the name of the
element, then compare it to the value of an element name supplied as a
parameter.

<xsl:param name="elmName" >MyElementName</xsl:param>
...
//*[local-name() = string($elmName)]
XQuery (and various other incarnations) will do it too, and with
better performance. However it's sometimes hard to find XQuery
features in an environment, but most will have XSLT and XPath
available.

I like Andy's answer better.
Jeff Kish
Jul 20 '05 #4
Hi Andy,

Thanks for the response. Actually the lag is not in getElementsByTagName,
but by the loop I have that compares the values of the tags with what the
user is looking for (myvalue). So I was wondering if there's a built in
mechanism that pulls elements based on their Value. When I say "Value" I
mean their content, not their name. i.e <Element>value</Element>. Sorry for
not being clear. It seems your examples of xpath get elements base on their
name, but not value.
Nash
Andy Dingley wrote:
On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <na**@oplink.net>
wrote:
This takes around 20 seconds to complete processing.


I'm not surprised ! getElementsByTagName is always slow, but it's
also inefficient here because it's having to look everywhere in the
structure to find elements to test their names. If you can improve
the search by looking for elements as children or grand-children,
rather than searching everywhere for them, then this can be a good
tweak.

XML is often incredibly powerful, but this excess power can lead to
inefficiencies if it's being used "by default" when you didn't really
need it.
So my question is, is
there some way where I can extract xml elements based on the element
value.


Yes, XPath ! Just use "//MyElementName"

Or if MyElementName is supplied by the users, then use a [...]
predicate and the local-name() function to get the name of the
element, then compare it to the value of an element name supplied as a
parameter.

<xsl:param name="elmName" >MyElementName</xsl:param>
...
//*[local-name() = string($elmName)]
XQuery (and various other incarnations) will do it too, and with
better performance. However it's sometimes hard to find XQuery
features in an environment, but most will have XSLT and XPath
available.


Jul 20 '05 #5
On Tue, 26 Oct 2004 10:09:27 -0500, Nash Kabbara <na**@oplink.net>
wrote:
Thanks for the response. Actually the lag is not in getElementsByTagName,
but by the loop I have that compares the values of the tags with what the
user is looking for (myvalue).
I don't recognise the coding platform - what is it ?

There's a lot you can do to improve that loop.
- Use an iterator not an array index
- Be suspicious of that .getlength() method, especially in an array
bound. Is that a per-iteration overhead you've given yourself ?
- never trim() when you can rtrim()
- Never trim() when you can use a space-ignoring comparison instead.

The trouble with much XML optimisation is that it becomes sensitive to
the data you feed it. Do you have a lot of matching elements to walk
through, or is finding the set of elements the main problem ?

So I was wondering if there's a built in
mechanism that pulls elements based on their Value. When I say "Value" I
mean their content, not their name. i.e <Element>value</Element>.


Yes, XPath !

Use a similar predicate, "//*[string (.) = $elmContents]"

string() is optional (because in this context it's the default
behaviour) but it's good practice to use it in situations like this,
because it makes reading your code a lot clearer in the future.

--
Smert' spamionam
Jul 20 '05 #6
I think youre coding in Java,

It is better to use SAX: Simple Api for XML.
You then dont have to load the entire DOM,
and you can do some optimizations.

SAX is a good choice if it is not too complex what you want to do.

Greetz
Tjerk

Nash Kabbara wrote:
Hi all,

I just finished writing a log reader that reads xml logs (about 1 to 2 MB
large). At the command line you can specify the file name, the name of the
element and it's value like so: logreader log.txt MyElement myvalue

In retrospect, I've noticed that it takes a long time to process. The time
is spent on comparing the value of all tags named MyElement to myvalue.
Namely:

NodeList nodeList = m_document.getElementsByTagName(MyElement);
for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
if(getTextNode(nodeList.item(index)).trim().equals (myvalue))
//getTextNode merely return the text value of the node
{
counter++;
tempIndex[arrIndex++] = index;
}

This takes around 20 seconds to complete processing. So my question is, is
there some way where I can extract xml elements based on the element value.
For example XPATH allows you to chose elements based to attribute value, so
I'm wondering, is there a similar mechanism that allows you to grab
elements based on their value?
Thanks.

Jul 20 '05 #7
<snip>
Yes, XPath !

Use a similar predicate, "//*[string (.) = $elmContents]"

string() is optional (because in this context it's the default
behaviour) but it's good practice to use it in situations like this,
because it makes reading your code a lot clearer in the future.

<snip>
lots of good info in this thread!
Yes, Sax if you don't need to load your entire object in memory.

Oh.. regarding xquery..

for $b in document("books.xml")//*[.="TCP/IP Illustrated"]
return
<temp>{string($b/.), name($b/.)}</temp>

{-- results in this output
<temp>TCP/IP Illustrated title</temp>
--}

Jeff Kish
Jul 20 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: John | last post by:
Hi everyone ! This is a first time I post a message here. If I post my message in a wrong group. Please ignore it. I am trying to build a website which allows users (can be thousands of...
18
by: jblazi | last post by:
I should like to search certain characters in a string and when they are found, I want to replace other characters in other strings that are at the same position (for a very simply mastermind game)...
4
by: tgiles | last post by:
Hi, all. Another bewildered newbie struggling with Python goodness. This time it's searching strings. The goal is to search a string for a value. The string is a variable I assigned the name...
2
by: Kakarot | last post by:
I'm gona be very honest here, I suck at programming, *especially* at C++. It's funny because I actually like the idea of programming ... normally what I like I'm atleast decent at. But C++ is a...
8
by: Gordon Knote | last post by:
Hi can anyone tell me what's the best way to search in binary content? Best if someone could post or link me to some source code (in C/C++). The search should be as fast as possible and it would...
33
by: Geoff Jones | last post by:
Hiya I have a DataTable containing thousands of records. Each record has a primary key field called "ID" and another field called "PRODUCT" I want to retrieve the rows that satisy the following...
5
by: justobservant | last post by:
When more than one keyword is typed into a search-query, most of the search-results displayed indicate specified keywords scattered throughout an entire website of content i.e., this is shown as...
7
by: pbd22 | last post by:
Hi. I am somewhat new to this and would like some advice. I want to search my xml file using "keyword" search and return results based on "proximity matching" - in other words, since the search...
5
by: lemlimlee | last post by:
hello, this is the task i need to do: For this task, you are to develop a Java program that allows a user to search or sort an array of numbers using an algorithm that the user chooses. The...
2
by: Bart Kastermans | last post by:
I have a file in which I am searching for the letter "i" (actually a bit more general than that, arbitrary regular expressions could occur) as long as it does not occur inside an expression that...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.