473,699 Members | 2,680 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Search for string, then extract entire XML element where it appears. How?

I need to extract some elements from a very large XML file. Because of
the size, I'd like to work with it on my Linux machine as a text file.

Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

The XML document is comprised of a bunch of <item> elements:

<?xml version="1.0" encoding="UTF-8"?>
<item>
<property1>10 0</property1>
<property2>
<id>0</id>
<code>ThisIsThe StringINeedToMa tch</code>
</property2>
<keyword>
<value>value1 </value>
<value>value2 </value>
</keyword>
<color>
<type>21</type>
<shade>1</shade>
</color>
</item>

How would you approach this? I can write a script to find each code,
but I'm not sure how to then search forwards/backwards to extract the
DNA element.

Thanks!

M

Jun 30 '06 #1
6 2801
ma******@gmail. com wrote:
Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

How would you approach this?


Using which tool?

In XPath, including XSLT, use ancestor::item to find the enclosing item
element.

If you're operating on the DOM APIs, simply iterate your way up the
parents looking for that item element... or use the filtered traversal
mechanisms, if your DOM supports them.

If you're working in SAX... SAX can't run backward, so it's up to you to
do some sort of buffering so you can re-scan once you recognize the item
as being one you're interested in.
Jun 30 '06 #2
I was hoping to just write a text parsing script using perl, for
example...

But I'm open to suggestions as to how most effectively to extract data
from this large file.

Joe Kesselman wrote:
ma******@gmail. com wrote:
Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

How would you approach this?


Using which tool?

In XPath, including XSLT, use ancestor::item to find the enclosing item
element.

If you're operating on the DOM APIs, simply iterate your way up the
parents looking for that item element... or use the filtered traversal
mechanisms, if your DOM supports them.

If you're working in SAX... SAX can't run backward, so it's up to you to
do some sort of buffering so you can re-scan once you recognize the item
as being one you're interested in.


Jun 30 '06 #3
ma******@gmail. com wrote:
I was hoping to just write a text parsing script using perl, for
example...


Can't help; I'm not a perl user, and I tend not to reinvent wheels
unless necessary.
Jun 30 '06 #4
ma******@gmail. com wrote:
I was hoping to just write a text parsing script using perl, for
example...

But I'm open to suggestions as to how most effectively to extract data
from this large file.

I think Joe Kesselman summarized your set of
options really comprehensively . Look at the
data and decide which kind of output you need.
You mentioned that (in case of a match), you
need the whole element. Do you need the element
exactly, with all possible sub-elements to
arbitrary depth ?

If the tree hierarchy is rather flat, then you
could use a SAX-like parser, as describe by Joe.
SAX-like parsers are available for most languages,
even Perl, bash, and gawk (which I prefer).
Jun 30 '06 #5
If it's a particularly huge file, I'd go with the buffed-SAX
semi-streaming solution. (Or, possibly, StAX -- which is a sort of cross
between SAX and DOM intended for this sort of chunk-at-a-time processing.)

Iterate through the document. For each item element, build an in-memory,
check its <code>, output it if it's one you want, and discard it so.
This way you don't have to keep the whole source document in memory at
once. As a refinement, for even better efficiencly, optimize this by
discarding the partly-built subtree (and events until it ends) as soon
as you see that the <code> isn't one you're looking for.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Jul 1 '06 #6
ma******@gmail. com wrote:
I was hoping to just write a text parsing script using perl, for
example...
Don't. There are subtleties about the way in which XML is formed
which will conspire to bite you in the ass if you use a non-XML
language.

Using Perl with one of the several XML APIs is fine, of course.
But I'm open to suggestions as to how most effectively to extract data
from this large file.
How large is large? XSLT runs pretty fast on a modern system, and what
you want to do isn't exactly rocket science (or if it is, I know any
number of unemployed rocket scientists who can do it for you :-)

This seems to do the job:

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:styleshe et xmlns:xsl="http ://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:output method="xml"/>

<xsl:template match="items">
<items>
<xsl:apply-templates/>
</items>
</xsl:template>

<xsl:template match="item">
<xsl:if test="contains( property2/code,'Match')">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:template>

</xsl:stylesheet>

///Peter
--
XML FAQ: http://xml.silmaril.ie/
Jul 3 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2674
by: Ken Fine | last post by:
I'm looking to find or create an ASP script that will take a string, examine it for a search term, and if it finds the search term in the string, return the highlighted search term along with the words that surround it. In other words, I want the search term highlighted and shown in an excerpt of the context in which it appears. Any suggestions or pointers? This behavior is most often seen as part of a search engine. In my case, I want...
8
3366
by: horos | last post by:
hey all, Ok, a related question to my previous one on data dumpers for postscript. In the process of putting a form together, I'm using a lot of placeholder variables that I really don't care about in the submitted action. I'd therefore like to get rid of them by doing something like:
7
2212
by: Steven Bethard | last post by:
How do I make sure that my entire string was parsed when I call a pyparsing element's parseString method? Here's a dramatically simplified version of my problem: py> import pyparsing as pp py> match = pp.Word(pp.nums) py> def parse_num(s, loc, toks): .... n, = toks .... return int(n) + 10 ....
52
3041
by: junky_fellow | last post by:
char *str1 = "Hello"; char arr1 = { "Hello" }; char arr2 = { 'H', 'e', 'l', 'l', 'o' }; Is it legal to modify str1, arr1 and arr2 ?
3
1994
by: Alex | last post by:
Hello. First, with AJAX I will get a remote web page into a string. Thus, a string will contain HTML tags and such. I will need to extract text from one <span> for which I know the ID the inner text. Is it possible to access in this way "string variable".getElementByID() somehow? Thank you.
0
1564
by: delphiconsultingguy | last post by:
Hi all, Spent WAAAYYY too much time trying to figure this out because there's not many good examples out there, so in the interest of sparing y'all from suff'rin same, I've pasted it into eternity for you. Works like a charm. (I know, I know, I love you too) Sean
3
1498
by: mdh_2972 | last post by:
I have an array of over 1000 links in a .JS file. I do not want to put the whole thing on my page because it would take to long to render the page. So how can I randomly pick 1 element from the array and then have the browser just place the 1 element only on my webpage. Also I also would like to know how to search for text in the array element and place it on my webpage. The first example is my main question.
1
1868
by: Nitinkcv | last post by:
Hi, I have a textbox and a button. In my textbox i have to enter the query string(say shoes) and on clicking the button takes me to a page show all item related to the search string( in this case shoes). But on mixing the search string with wildcards it displays that no items could be found. For eg: for search string s@h^o$e@s it would go to the error page. So is there ant way i could like extract the wildcards out of my search string...
5
3032
by: MJK | last post by:
Suppose I have the following function in my program: void ExtractData(Ind *AM) { int i,j; char str; char c; FILE *ext=fopen("test.out","r"); //suppose I have N line each with M digits (in here M=5) like: 1 2 2
0
8615
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9173
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9033
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
7748
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6533
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4375
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4627
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2345
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2009
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.