By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,870 Members | 1,189 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,870 IT Pros & Developers. It's quick & easy.

Search for string, then extract entire XML element where it appears. How?

P: n/a
I need to extract some elements from a very large XML file. Because of
the size, I'd like to work with it on my Linux machine as a text file.

Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

The XML document is comprised of a bunch of <item> elements:

<?xml version="1.0" encoding="UTF-8"?>
<item>
<property1>100</property1>
<property2>
<id>0</id>
<code>ThisIsTheStringINeedToMatch</code>
</property2>
<keyword>
<value>value1</value>
<value>value2</value>
</keyword>
<color>
<type>21</type>
<shade>1</shade>
</color>
</item>

How would you approach this? I can write a script to find each code,
but I'm not sure how to then search forwards/backwards to extract the
DNA element.

Thanks!

M

Jun 30 '06 #1
Share this Question
Share on Google+
6 Replies


P: n/a
ma******@gmail.com wrote:
Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

How would you approach this?


Using which tool?

In XPath, including XSLT, use ancestor::item to find the enclosing item
element.

If you're operating on the DOM APIs, simply iterate your way up the
parents looking for that item element... or use the filtered traversal
mechanisms, if your DOM supports them.

If you're working in SAX... SAX can't run backward, so it's up to you to
do some sort of buffering so you can re-scan once you recognize the item
as being one you're interested in.
Jun 30 '06 #2

P: n/a
I was hoping to just write a text parsing script using perl, for
example...

But I'm open to suggestions as to how most effectively to extract data
from this large file.

Joe Kesselman wrote:
ma******@gmail.com wrote:
Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

How would you approach this?


Using which tool?

In XPath, including XSLT, use ancestor::item to find the enclosing item
element.

If you're operating on the DOM APIs, simply iterate your way up the
parents looking for that item element... or use the filtered traversal
mechanisms, if your DOM supports them.

If you're working in SAX... SAX can't run backward, so it's up to you to
do some sort of buffering so you can re-scan once you recognize the item
as being one you're interested in.


Jun 30 '06 #3

P: n/a
ma******@gmail.com wrote:
I was hoping to just write a text parsing script using perl, for
example...


Can't help; I'm not a perl user, and I tend not to reinvent wheels
unless necessary.
Jun 30 '06 #4

P: n/a
ma******@gmail.com wrote:
I was hoping to just write a text parsing script using perl, for
example...

But I'm open to suggestions as to how most effectively to extract data
from this large file.

I think Joe Kesselman summarized your set of
options really comprehensively. Look at the
data and decide which kind of output you need.
You mentioned that (in case of a match), you
need the whole element. Do you need the element
exactly, with all possible sub-elements to
arbitrary depth ?

If the tree hierarchy is rather flat, then you
could use a SAX-like parser, as describe by Joe.
SAX-like parsers are available for most languages,
even Perl, bash, and gawk (which I prefer).
Jun 30 '06 #5

P: n/a
If it's a particularly huge file, I'd go with the buffed-SAX
semi-streaming solution. (Or, possibly, StAX -- which is a sort of cross
between SAX and DOM intended for this sort of chunk-at-a-time processing.)

Iterate through the document. For each item element, build an in-memory,
check its <code>, output it if it's one you want, and discard it so.
This way you don't have to keep the whole source document in memory at
once. As a refinement, for even better efficiencly, optimize this by
discarding the partly-built subtree (and events until it ends) as soon
as you see that the <code> isn't one you're looking for.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Jul 1 '06 #6

P: n/a
ma******@gmail.com wrote:
I was hoping to just write a text parsing script using perl, for
example...
Don't. There are subtleties about the way in which XML is formed
which will conspire to bite you in the ass if you use a non-XML
language.

Using Perl with one of the several XML APIs is fine, of course.
But I'm open to suggestions as to how most effectively to extract data
from this large file.
How large is large? XSLT runs pretty fast on a modern system, and what
you want to do isn't exactly rocket science (or if it is, I know any
number of unemployed rocket scientists who can do it for you :-)

This seems to do the job:

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:output method="xml"/>

<xsl:template match="items">
<items>
<xsl:apply-templates/>
</items>
</xsl:template>

<xsl:template match="item">
<xsl:if test="contains(property2/code,'Match')">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:template>

</xsl:stylesheet>

///Peter
--
XML FAQ: http://xml.silmaril.ie/
Jul 3 '06 #7

This discussion thread is closed

Replies have been disabled for this discussion.