473,779 Members | 2,001 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to search HUGE XML with DOM?

a relation database has admiring search efficiency when the database is
very big (several thousands or tens of thousands of records). But my
current project is based on XML, for its tree-like data structure has
much more flexibility; and DOM, which could be manipulated just like a
tree. However, how to establish such a XML data base for search when it
contains 10,000 records (One record usually contain 10~30 tags) or
more?

My search needs:
1. Search and return all the record (an element) with specific id.
2. Search and return all the record whose child nodes has a specific id
or attribute.

the xml.dom.minidom object is too slow when parsing such a big XML file
to a DOM object. while pulldom should spend quite a long time going
through the whole database file. How to enhance the searching speed?
Are there existing solution or algorithm? Thank you for your
suggetion...

Mar 31 '06 #1
8 1702
> the xml.dom.minidom object is too slow when parsing such a big XML file
to a DOM object. while pulldom should spend quite a long time going
through the whole database file. How to enhance the searching speed?
Are there existing solution or algorithm? Thank you for your
suggetion...


I've told you that before, and I tell you again: RDBMS is the way to go.
There might be XML-parsers that work faster - I suppose cElementTree can
gain you some speed - but ultimately the problems are inherent in the
representation as DOM: no type-information, no indices, no nothing. Just a
huge pile of nodes in memory.

So all searches are linear in the number of nodes. Of course you might be
able to create indices yourself, even devise a clever scheme to make using
them as declarative as possible. But that would in the end mean nothing but
re-creating RDBMS technology - why do that, if it's already there?

Maybe there are frameworks out there that support you in this, but the very
nature of XML makes that for sure a more tedious task than just defining a
simple SQL-Schema. If I'd have to search for some XML-tools that go beyond
DOM, I'd go for uche ogbuji's 4suite as a starter and work my way down from
there - maybe AMARA is what you need?

Now having said that: I'm not a SQL-bigot. Just use the right tool for the
job.

Regards,

Diez
Mar 31 '06 #2
Sullivan WxPyQtKinter wrote:
a relation database has admiring search efficiency when the database is
very big (several thousands or tens of thousands of records). But my
current project is based on XML, for its tree-like data structure has
much more flexibility; and DOM, which could be manipulated just like a
tree. However, how to establish such a XML data base for search when it
contains 10,000 records (One record usually contain 10~30 tags) or
more?

My search needs:
1. Search and return all the record (an element) with specific id.
2. Search and return all the record whose child nodes has a specific id
or attribute.

the xml.dom.minidom object is too slow when parsing such a big XML file
to a DOM object. while pulldom should spend quite a long time going
through the whole database file. How to enhance the searching speed?
Are there existing solution or algorithm? Thank you for your
suggetion...


- have a look at cElementTree ?
- store your XML as persistant objects in a ZODB instance, then use ZODB
catalog for queries ?
- index relevant data in a DB (RDBMS, Berkeley, whatever...) ?
- have a look at 4suite (http://4suite.org/index.xhtml) ?

My 2 cents...
--
bruno desthuilliers
python -c "print '@'.join(['.'.join([w[::-1] for w in p.split('.')]) for
p in 'o****@xiludom. gro'.split('@')])"
Mar 31 '06 #3
Diez B. Roggisch wrote:
the xml.dom.minidom object is too slow when parsing such a big XML file
to a DOM object. while pulldom should spend quite a long time going
through the whole database file. How to enhance the searching speed?
Are there existing solution or algorithm? Thank you for your
suggetion...
I've told you that before, and I tell you again: RDBMS is the way to go.


We've lost some context from the original post that may be relevant
here, but if populating what the original questioner calls "the
database" is an infrequent operation, then an RDBMS probably is the way
to go, in general. On the other hand, if a lot of parsing has to happen
in order to perform a search, such parsing would probably incur a lot
of overhead from SQL inserts that wouldn't be particularly desirable.
There might be XML-parsers that work faster - I suppose cElementTree can
gain you some speed - but ultimately the problems are inherent in the
representation as DOM: no type-information, no indices, no nothing. Just a
huge pile of nodes in memory.
Well, I would hope that W3C DOM operations like getElementById would be
supported by some index in the implementation: that would make some of
the searches mentioned by the questioner fairly rapid, given enough
memory.
So all searches are linear in the number of nodes. Of course you might be
able to create indices yourself, even devise a clever scheme to make using
them as declarative as possible. But that would in the end mean nothing but
re-creating RDBMS technology - why do that, if it's already there?


I agree that careful usage of RDBMS technology would solve the general
problems of searching large amounts of data, but the stated queries
should involve indexes and be fairly quick.

Paul

Mar 31 '06 #4
Mind, that XML documents are not more flexible than RDBMS.

You can represent any XML document in a RDBMS. You cannot represent any
RDBMS in an XML document. RDBMS are (strictly spoken) relations and XML
documents are trees. Relations are superior to trees, at least
mathematically speaking.

Once you have set up your system in a practicable way (e.G. not needing
to create a new table via SQL Queries for a new type of node, which
would be a pain) SQL is far superior to XML.

Anyway, cElementTree seems to be the best way to go for you now. Its
performance is untopped by any other python xml library, as far as I
know.

Mar 31 '06 #5

On 31-Mar-06, at 11:17 AM, bayerj wrote:
Mind, that XML documents are not more flexible than RDBMS.

You can represent any XML document in a RDBMS. You cannot represent
any
RDBMS in an XML document. RDBMS are (strictly spoken) relations and
XML
documents are trees. Relations are superior to trees, at least
mathematically speaking.

Once you have set up your system in a practicable way (e.G. not
needing
to create a new table via SQL Queries for a new type of node, which
would be a pain) SQL is far superior to XML.

Anyway, cElementTree seems to be the best way to go for you now. Its
performance is untopped by any other python xml library, as far as I
know.

--
http://mail.python.org/mailman/listinfo/python-list


If I may hijack this thread for a bit, I'd like to dig deeper into
this issue :)

Currently my simulation program produces an XML log file with events
represented as nodes.
Often those files grow to multiple GB size. I like this setup because
the format is open
and easily parse-able with a variety of tools. So I have a bunch I
scripts that can analyze
different aspects of the simulation.

I have not much clue about databases, except that they exist,
somewhat complex, and often
use proprietary formats for efficiency. So any points on whether RDBM-
based setup
would be better would be greatly appreciated.

Even trivial aspects, such as whether to produce RDBM during the
simulation, or convert the complete XML log file into one, are not
entirely clear to me. I gather that RDBM would be much better suited
for analysis, but what about portability ? Is database file a
separate entity that may be passed around?

Apologies if this seems like a selfish question, perhaps consider it
a full disclosure, different set-ups/examples would be appreciated as
well.

--
Cheers, Ivan

Mar 31 '06 #6
Perhaps what you have said is correct. But XML is more direct for
programmers and readers in my view point.

bayerj 写道:
Mind, that XML documents are not more flexible than RDBMS.

You can represent any XML document in a RDBMS. You cannot represent any
RDBMS in an XML document. RDBMS are (strictly spoken) relations and XML
documents are trees. Relations are superior to trees, at least
mathematically speaking.

Once you have set up your system in a practicable way (e.G. not needing
to create a new table via SQL Queries for a new type of node, which
would be a pain) SQL is far superior to XML.

Anyway, cElementTree seems to be the best way to go for you now. Its
performance is untopped by any other python xml library, as far as I
know.


Apr 1 '06 #7
Ivan Vinogradov wrote:
I have not much clue about databases, except that they exist, somewhat
complex, and often use proprietary formats for efficiency.
Prorietary storage format, but a standardized API...
So any points on whether RDBM-based setup
would be better would be greatly appreciated.
The typical use case for RDBMS is that you have a number
of record types (classes/relations/tables) with a regular
structure, and all data fits into these structures. When
you want to search for something, you know exactly in what
field of what table to look (but not which item of course).
You also typically have multiple users who need to be able
to update the same database simultaneously without getting
in each others way.
Even trivial aspects, such as whether to produce RDBM during the
simulation, or convert the complete XML log file into one, are not
entirely clear to me.
Most databases as suited at writing data in fairly small chunks,
although it's typically much faster to write 100 items in a
transaction, than to write 100 transactions with one item each.
I gather that RDBM would be much better suited for
analysis, but what about portability ? Is database file a separate
entity that may be passed around?


Who says that a database needs to reside in a file? Most
databases reside on disk, but it might well be in raw
partitions.

In general, you should see the database as a persistent
representation of data in a system. It's not a transport
mechanism.
Apr 7 '06 #8
Sullivan WxPyQtKinter wrote:
My search needs:
1. Search and return all the record (an element) with specific id.
2. Search and return all the record whose child nodes has a specific id
or attribute.


Try lxml, which is based on the libxml2 library. The current SVN version has
support for xml:id through the XMLDTDID function. It simply returns an XML
tree and an ID dictionary.

http://codespeak.net/lxml

Stefan
Apr 25 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
5020
by: leegold2 | last post by:
Commonly done, eg. you enter a word in a search engine and when a hit-page comes up the search word(s) are highlighted. I'm doing a fulltext search that works well but I've tried a few "packaged scripts" and haven't got one to work yet. I'm looking for straightforward understandable way to do this on my MYSQL/PHP pages. Thanks, Lee
0
4255
by: j | last post by:
Hi, Anyone out there with binary search tree experience. Working on a project due tomorrow and really stuck. We need a function that splits a binary tree into a bigger one and smaller one(for a random binary search tree. We've tried everything but are in the land of pointer hell. If someone could help it would be a huge help. our code follows. We've tried 2 diff methods the split() and splitter() functions #include <iostream> #include...
6
3038
by: richard.pasco | last post by:
Hey all I am trying to write a script that allows users to search through a database of names. But rather than give a search string and then return all those that match, I would like it to search each time the user types a new letter in the text box. That's badly explained - this example might help! The user types "r" and I displays all the names with r in them. The
0
319
by: Huihong | last post by:
Please check out our source code search engine here, http://www.codase.com e.g., search main method, http://www.codase.com/search/definition?name=main&class=&lang=*&project=&type=&parameters=&nparams=-1 Rather than treating code as text, Codase understands programming languages, and treats code as code, the way it's supposed to be. This unique and syntax-aware approach provides the most accurate and detailed search results with fine...
3
1358
by: jdworley | last post by:
Howdy! The search is a 1 GB+ file from a crash to be exact - had an e-mail correspondence (window) hopefully in it so I saved the file on reboot before windoze kicked in and want to open and search it for my e-mail, then delete the file... Anyone know of a program to do that which can find and then go up and down to extract or cut and paste my e-mail. It took a day to write that e-mail and I can't imagine trying
8
3622
by: Huihong | last post by:
Please check out our newly released source code search engine here, http://www.codase.com e.g., search socket method call, http://www.codase.com/search/call?name=socket&owner=&lang=*&type=&parameters=&obj= Rather than treating code as text, Codase understands programming languages, and treats code as code, the way it's supposed to be. This unique and syntax-aware approach provides the most accurate and detailed search results with...
6
3333
by: DC | last post by:
Hi, I am programming a search catalogue with 200000 items (and growing). I am currently using the SQL Server 2000 fulltext engine for this task but it does not fit the requirements anymore. The products typically do have a verbose name, "canadian superapples: red tasty juicy macintosh apple from toronto" and the like. If a customer is looking for "canadian apple" this product needs to match, but also if he is looking for "juicy mac".
0
1135
by: Huihong | last post by:
Please check out our source code search engine here, http://www.codase.com e.g., search the "main" method, http://www.codase.com/search/definition?name=main&class=&lang=*&project=&type=&parameters=&nparams=-1 Rather than treating code as text, Codase understands programming languages, and treats code as code, the way it's supposed to be. This unique and syntax-aware approach provides the most accurate and detailed search results with...
9
3141
by: Rick | last post by:
I have a large list of objects where each object has a unique (non-overlapping) date range. The list is sorted chronologically. What is the most efficient way to search this list for a single object that spans a specified date?
0
9632
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main usage, and What is the difference between ONU and Router. Lets take a closer look ! Part I. Meaning of...
0
10136
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10071
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9925
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7478
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6723
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5372
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5501
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4036
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.