473,748 Members | 8,376 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Search for string, then extract entire XML element where it appears. How?

I need to extract some elements from a very large XML file. Because of
the size, I'd like to work with it on my Linux machine as a text file.

Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

The XML document is comprised of a bunch of <item> elements:

<?xml version="1.0" encoding="UTF-8"?>
<item>
<property1>10 0</property1>
<property2>
<id>0</id>
<code>ThisIsThe StringINeedToMa tch</code>
</property2>
<keyword>
<value>value1 </value>
<value>value2 </value>
</keyword>
<color>
<type>21</type>
<shade>1</shade>
</color>
</item>

How would you approach this? I can write a script to find each code,
but I'm not sure how to then search forwards/backwards to extract the
DNA element.

Thanks!

M

Jun 30 '06 #1
6 2803
ma******@gmail. com wrote:
Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

How would you approach this?


Using which tool?

In XPath, including XSLT, use ancestor::item to find the enclosing item
element.

If you're operating on the DOM APIs, simply iterate your way up the
parents looking for that item element... or use the filtered traversal
mechanisms, if your DOM supports them.

If you're working in SAX... SAX can't run backward, so it's up to you to
do some sort of buffering so you can re-scan once you recognize the item
as being one you're interested in.
Jun 30 '06 #2
I was hoping to just write a text parsing script using perl, for
example...

But I'm open to suggestions as to how most effectively to extract data
from this large file.

Joe Kesselman wrote:
ma******@gmail. com wrote:
Basically, I am going to have a list of specific strings I'm searching
for. For each string, I need to search through the XML file, and when
I find that string (in the tag <code>), copy the entire <item> XML
element that the code appears in, into another text file.

How would you approach this?


Using which tool?

In XPath, including XSLT, use ancestor::item to find the enclosing item
element.

If you're operating on the DOM APIs, simply iterate your way up the
parents looking for that item element... or use the filtered traversal
mechanisms, if your DOM supports them.

If you're working in SAX... SAX can't run backward, so it's up to you to
do some sort of buffering so you can re-scan once you recognize the item
as being one you're interested in.


Jun 30 '06 #3
ma******@gmail. com wrote:
I was hoping to just write a text parsing script using perl, for
example...


Can't help; I'm not a perl user, and I tend not to reinvent wheels
unless necessary.
Jun 30 '06 #4
ma******@gmail. com wrote:
I was hoping to just write a text parsing script using perl, for
example...

But I'm open to suggestions as to how most effectively to extract data
from this large file.

I think Joe Kesselman summarized your set of
options really comprehensively . Look at the
data and decide which kind of output you need.
You mentioned that (in case of a match), you
need the whole element. Do you need the element
exactly, with all possible sub-elements to
arbitrary depth ?

If the tree hierarchy is rather flat, then you
could use a SAX-like parser, as describe by Joe.
SAX-like parsers are available for most languages,
even Perl, bash, and gawk (which I prefer).
Jun 30 '06 #5
If it's a particularly huge file, I'd go with the buffed-SAX
semi-streaming solution. (Or, possibly, StAX -- which is a sort of cross
between SAX and DOM intended for this sort of chunk-at-a-time processing.)

Iterate through the document. For each item element, build an in-memory,
check its <code>, output it if it's one you want, and discard it so.
This way you don't have to keep the whole source document in memory at
once. As a refinement, for even better efficiencly, optimize this by
discarding the partly-built subtree (and events until it ends) as soon
as you see that the <code> isn't one you're looking for.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Jul 1 '06 #6
ma******@gmail. com wrote:
I was hoping to just write a text parsing script using perl, for
example...
Don't. There are subtleties about the way in which XML is formed
which will conspire to bite you in the ass if you use a non-XML
language.

Using Perl with one of the several XML APIs is fine, of course.
But I'm open to suggestions as to how most effectively to extract data
from this large file.
How large is large? XSLT runs pretty fast on a modern system, and what
you want to do isn't exactly rocket science (or if it is, I know any
number of unemployed rocket scientists who can do it for you :-)

This seems to do the job:

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:styleshe et xmlns:xsl="http ://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:output method="xml"/>

<xsl:template match="items">
<items>
<xsl:apply-templates/>
</items>
</xsl:template>

<xsl:template match="item">
<xsl:if test="contains( property2/code,'Match')">
<xsl:copy-of select="."/>
</xsl:if>
</xsl:template>

</xsl:stylesheet>

///Peter
--
XML FAQ: http://xml.silmaril.ie/
Jul 3 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2677
by: Ken Fine | last post by:
I'm looking to find or create an ASP script that will take a string, examine it for a search term, and if it finds the search term in the string, return the highlighted search term along with the words that surround it. In other words, I want the search term highlighted and shown in an excerpt of the context in which it appears. Any suggestions or pointers? This behavior is most often seen as part of a search engine. In my case, I want...
8
3370
by: horos | last post by:
hey all, Ok, a related question to my previous one on data dumpers for postscript. In the process of putting a form together, I'm using a lot of placeholder variables that I really don't care about in the submitted action. I'd therefore like to get rid of them by doing something like:
7
2214
by: Steven Bethard | last post by:
How do I make sure that my entire string was parsed when I call a pyparsing element's parseString method? Here's a dramatically simplified version of my problem: py> import pyparsing as pp py> match = pp.Word(pp.nums) py> def parse_num(s, loc, toks): .... n, = toks .... return int(n) + 10 ....
52
3050
by: junky_fellow | last post by:
char *str1 = "Hello"; char arr1 = { "Hello" }; char arr2 = { 'H', 'e', 'l', 'l', 'o' }; Is it legal to modify str1, arr1 and arr2 ?
3
2005
by: Alex | last post by:
Hello. First, with AJAX I will get a remote web page into a string. Thus, a string will contain HTML tags and such. I will need to extract text from one <span> for which I know the ID the inner text. Is it possible to access in this way "string variable".getElementByID() somehow? Thank you.
0
1567
by: delphiconsultingguy | last post by:
Hi all, Spent WAAAYYY too much time trying to figure this out because there's not many good examples out there, so in the interest of sparing y'all from suff'rin same, I've pasted it into eternity for you. Works like a charm. (I know, I know, I love you too) Sean
3
1500
by: mdh_2972 | last post by:
I have an array of over 1000 links in a .JS file. I do not want to put the whole thing on my page because it would take to long to render the page. So how can I randomly pick 1 element from the array and then have the browser just place the 1 element only on my webpage. Also I also would like to know how to search for text in the array element and place it on my webpage. The first example is my main question.
1
1873
by: Nitinkcv | last post by:
Hi, I have a textbox and a button. In my textbox i have to enter the query string(say shoes) and on clicking the button takes me to a page show all item related to the search string( in this case shoes). But on mixing the search string with wildcards it displays that no items could be found. For eg: for search string s@h^o$e@s it would go to the error page. So is there ant way i could like extract the wildcards out of my search string...
5
3036
by: MJK | last post by:
Suppose I have the following function in my program: void ExtractData(Ind *AM) { int i,j; char str; char c; FILE *ext=fopen("test.out","r"); //suppose I have N line each with M digits (in here M=5) like: 1 2 2
0
8991
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8830
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9370
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9321
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9247
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6796
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6074
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4874
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2782
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.