473,574 Members | 3,082 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Search for string, then extract entire XML element where it appears. How?

I need to extract some elements from a very large XML file. Because of
the size, I'd like to work with it on my Linux machine as a text file.

Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

The XML document is comprised of a bunch of <item> elements:

<?xml version="1.0" encoding="UTF-8"?>
<item>
<property1>10 0</property1>
<property2>
<id>0</id>
<code>ThisIsThe StringINeedToMa tch</code>
</property2>
<keyword>
<value>value1 </value>
<value>value2 </value>
</keyword>
<color>
<type>21</type>
<shade>1</shade>
</color>
</item>

How would you approach this? I can write a script to find each code,
but I'm not sure how to then search forwards/backwards to extract the
DNA element.

Thanks!

M

Jun 30 '06 #1
6 2797
ma******@gmail. com wrote:
Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

How would you approach this?


Using which tool?

In XPath, including XSLT, use ancestor::item to find the enclosing item
element.

If you're operating on the DOM APIs, simply iterate your way up the
parents looking for that item element... or use the filtered traversal
mechanisms, if your DOM supports them.

If you're working in SAX... SAX can't run backward, so it's up to you to
do some sort of buffering so you can re-scan once you recognize the item
as being one you're interested in.
Jun 30 '06 #2
I was hoping to just write a text parsing script using perl, for
example...

But I'm open to suggestions as to how most effectively to extract data
from this large file.

Joe Kesselman wrote:
ma******@gmail. com wrote:
Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

How would you approach this?


Using which tool?

In XPath, including XSLT, use ancestor::item to find the enclosing item
element.

If you're operating on the DOM APIs, simply iterate your way up the
parents looking for that item element... or use the filtered traversal
mechanisms, if your DOM supports them.

If you're working in SAX... SAX can't run backward, so it's up to you to
do some sort of buffering so you can re-scan once you recognize the item
as being one you're interested in.


Jun 30 '06 #3
ma******@gmail. com wrote:
I was hoping to just write a text parsing script using perl, for
example...


Can't help; I'm not a perl user, and I tend not to reinvent wheels
unless necessary.
Jun 30 '06 #4
ma******@gmail. com wrote:
I was hoping to just write a text parsing script using perl, for
example...

But I'm open to suggestions as to how most effectively to extract data
from this large file.

I think Joe Kesselman summarized your set of
options really comprehensively . Look at the
data and decide which kind of output you need.
You mentioned that (in case of a match), you
need the whole element. Do you need the element
exactly, with all possible sub-elements to
arbitrary depth ?

If the tree hierarchy is rather flat, then you
could use a SAX-like parser, as describe by Joe.
SAX-like parsers are available for most languages,
even Perl, bash, and gawk (which I prefer).
Jun 30 '06 #5
If it's a particularly huge file, I'd go with the buffed-SAX
semi-streaming solution. (Or, possibly, StAX -- which is a sort of cross
between SAX and DOM intended for this sort of chunk-at-a-time processing.)

Iterate through the document. For each item element, build an in-memory,
check its <code>, output it if it's one you want, and discard it so.
This way you don't have to keep the whole source document in memory at
once. As a refinement, for even better efficiencly, optimize this by
discarding the partly-built subtree (and events until it ends) as soon
as you see that the <code> isn't one you're looking for.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Jul 1 '06 #6
ma******@gmail. com wrote:
I was hoping to just write a text parsing script using perl, for
example...
Don't. There are subtleties about the way in which XML is formed
which will conspire to bite you in the ass if you use a non-XML
language.

Using Perl with one of the several XML APIs is fine, of course.
But I'm open to suggestions as to how most effectively to extract data
from this large file.
How large is large? XSLT runs pretty fast on a modern system, and what
you want to do isn't exactly rocket science (or if it is, I know any
number of unemployed rocket scientists who can do it for you :-)

This seems to do the job:

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:styleshe et xmlns:xsl="http ://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:output method="xml"/>

<xsl:template match="items">
<items>
<xsl:apply-templates/>
</items>
</xsl:template>

<xsl:template match="item">
<xsl:if test="contains( property2/code,'Match')">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:template>

</xsl:stylesheet>

///Peter
--
XML FAQ: http://xml.silmaril.ie/
Jul 3 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2669
by: Ken Fine | last post by:
I'm looking to find or create an ASP script that will take a string, examine it for a search term, and if it finds the search term in the string, return the highlighted search term along with the words that surround it. In other words, I want the search term highlighted and shown in an excerpt of the context in which it appears. Any...
8
3342
by: horos | last post by:
hey all, Ok, a related question to my previous one on data dumpers for postscript. In the process of putting a form together, I'm using a lot of placeholder variables that I really don't care about in the submitted action. I'd therefore like to get rid of them by doing something like:
7
2204
by: Steven Bethard | last post by:
How do I make sure that my entire string was parsed when I call a pyparsing element's parseString method? Here's a dramatically simplified version of my problem: py> import pyparsing as pp py> match = pp.Word(pp.nums) py> def parse_num(s, loc, toks): .... n, = toks .... return int(n) + 10 ....
52
3013
by: junky_fellow | last post by:
char *str1 = "Hello"; char arr1 = { "Hello" }; char arr2 = { 'H', 'e', 'l', 'l', 'o' }; Is it legal to modify str1, arr1 and arr2 ?
3
1986
by: Alex | last post by:
Hello. First, with AJAX I will get a remote web page into a string. Thus, a string will contain HTML tags and such. I will need to extract text from one <span> for which I know the ID the inner text. Is it possible to access in this way "string variable".getElementByID() somehow? Thank you.
0
1555
by: delphiconsultingguy | last post by:
Hi all, Spent WAAAYYY too much time trying to figure this out because there's not many good examples out there, so in the interest of sparing y'all from suff'rin same, I've pasted it into eternity for you. Works like a charm. (I know, I know, I love you too) Sean
3
1492
by: mdh_2972 | last post by:
I have an array of over 1000 links in a .JS file. I do not want to put the whole thing on my page because it would take to long to render the page. So how can I randomly pick 1 element from the array and then have the browser just place the 1 element only on my webpage. Also I also would like to know how to search for text in the array...
1
1859
by: Nitinkcv | last post by:
Hi, I have a textbox and a button. In my textbox i have to enter the query string(say shoes) and on clicking the button takes me to a page show all item related to the search string( in this case shoes). But on mixing the search string with wildcards it displays that no items could be found. For eg: for search string s@h^o$e@s it would go to...
5
3029
by: MJK | last post by:
Suppose I have the following function in my program: void ExtractData(Ind *AM) { int i,j; char str; char c; FILE *ext=fopen("test.out","r"); //suppose I have N line each with M digits (in here M=5) like: 1 2 2
0
7842
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7764
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8110
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8273
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7862
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
1
5658
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3794
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2277
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1375
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.