Highly optimized business-rule validation of very large XML documents possible?

Mike

Related to another topic I just posted, I wanted to discuss ways to optimize
the validation of very large (>100MB) XML documents.

First, I have no idea if something like this already exists; it may even be
the typical implementation for all I know.

At any rate, it occurs to me that the set of business rules that need to be
validated against an XML document represent a limited set of nodes at any
given time (while parsing through the document). For example, if there is a
parent<->child node dependency, then only the pertinent information related
to those nodes needs to be kept in memory. Once the dependency has been
resolved (by validating the rule), the memory associated with those nodes
could then be freed. In this way, large documents could be validated
efficiently, by only storing information related to dependencies, and
immediately freeing memory once the dependency is resolved.

I don't have a lot of practical XML experience. But I've read, for example,
that using a SAX parser can be difficult in cases where you need to maintain
a lot of "state" information. So, what I'm asking is whether there is a
general solution to this problem, rather than having application-specific
code to handle the "state" of dependencies?

It seems to me that rule dependencies could be represented by a "graph",
similar in some ways to the Java garbage collector. And, like the garbage
collector, memory could be freed once there are no more "references" to a
particular dependency. The dependencies themselves would be something like
"threads" that connect nodes. Larger threads would require more memory.
Further optimization might be achieved by determining if the dependency
threads are better suited for a depth-first or breadth-first traversal, or
some combination.

In my other post, I ask about whether XML Schema can be used for validation
of rules like this, or if there are other solutions. In the context of this
post, does XML Schema or any other method support any of the concepts I talk
about above?

If XML Schema is unable to handle rules like this, and there is no other
available solution, does it make sense that something based on XPath might
work? I'm wondering if the XPath expressions could be used to represent the
dependencies (as in what to keep in memory), and then something else would
actually evaluate the dependency.

Thanks for any help/suggestions/comments,

Mike

Jul 20 '05 #1

Subscribe Post Reply

2937

Patrick TJ McPhee

In article <IP********************@giganews.com>,
Mike <mm********@yahoo.com> wrote:

% I don't have a lot of practical XML experience. But I've read, for example,
% that using a SAX parser can be difficult in cases where you need to maintain
% a lot of "state" information. So, what I'm asking is whether there is a

It's not difficult, it's just that you have to maintain the state information.
The parser won't do it for you.

% general solution to this problem, rather than having application-specific
% code to handle the "state" of dependencies?

There are SAX parsers which validate against DTDs. They presumably have
general ways of doing this without saving unneeded nodes. If you find
a SAX parser which validates against some other schema mechanism, then
they probably do it reasonably memory-efficiently. I don't think there's
anything in any of the common schema languages which precludes this.

% If XML Schema is unable to handle rules like this, and there is no other
% available solution, does it make sense that something based on XPath might
% work?

XPath more-or-less implies that you build a tree of some sort in memory,
so it works against what you want.
--

Patrick TJ McPhee
East York Canada
pt**@interlog.com

Jul 20 '05 #2

Alain Frisch

Patrick TJ McPhee, dans le message (comp.text.xml:58709), a écrit :

There are SAX parsers which validate against DTDs. They presumably have
general ways of doing this without saving unneeded nodes. If you find
a SAX parser which validates against some other schema mechanism, then
they probably do it reasonably memory-efficiently. I don't think there's
anything in any of the common schema languages which precludes this.
DTD is particularly easy to validate on a SAX stream, because of the
determinism condition of regular expressions (and the ID/IDREF constraints
are completely orthogonal, and also easy to check). Concerning XML Schema,
it is slightly more complex, but still, AFAICT, you can implement
validation without using complex tree automata (because regular expression
are actually regular expressions on tag names, not types). I guess the
memory complexity is linear in the depth of the document.
XPath more-or-less implies that you build a tree of some sort in memory,
so it works against what you want.

What are your arguments? Is this only an intuition? A large subset of
XPath can be rewritten to avoid backward axis, and can be evaluated with a
top-down left-to-right strategy, compatible with stream evaluation.

Jul 20 '05 #3

Patrick TJ McPhee

In article <bp***********@nef.ens.fr>,
Alain Frisch <fr****@clipper.ens.fr> wrote:
% Patrick TJ McPhee, dans le message (comp.text.xml:58709), a écrit :

% > XPath more-or-less implies that you build a tree of some sort in memory,
% > so it works against what you want.
%
% What are your arguments? Is this only an intuition?

Granted `more-or-less implies' sounds a bit like I'm stating a thesis, but
I'm not categorically saying that it has to be that way. I would not be
at all surprised to find that one could write an XPath implementation which
works on a stream. Presumably, you'd want to get all the expressions to
evaluate up front, so that you don't have to run over the stream once
for each query, but that's just housekeeping.

My statement is based a little on XPath's theoretical use of a tree as
its document representation, and more significantly on my ignorance of
any XPath implementation which works on anything but a DOM tree. Some
of the databases put the tree (of some sort) on disk memory rather than
putting it on the bus, but I don't know of an implementation that works
against a stream. The OP did not sound like someone who was interested
in writing an XPath implementation to solve his or her problems, so the
available implementations are an issue.

--

Patrick TJ McPhee
East York Canada
pt**@interlog.com

Jul 20 '05 #4

Bob Foster

Nobody could answer your question without knowing what kind of "business
rules" you want to check.

But if the answer is you want to be able to check an arbitrary set of
XPath-based assertions, then there is only one well-known schema language
that can handle it - Schematron - and AFAIK all Schematron implementations
keep the entire document tree in memory (for starters). This doesn't
necessarily mean you couldn't handle a 100MB document, but that you should
expect to use a lot of memory to do it.

If you determine Schematron would meet your needs, I'd recommend trying some
test cases with, at least, an XSL-based implementation using Saxon as the
XSL processor. Saxon works very hard (in its tiny tree mode) to keep a
compact representation of the document. The other consideration is execution
time, which might also be considerable.

To answer your question from a theoretical perspective, several authors have
claimed that all XPath 1.0 expressions can be evaluated in a single
top-down, left-to-right pass through a document. A google search will turn
them up. If this is true, and if the context of every XPath expression in a
schema can be determined statically, it should be possible to do your
"business rule" checking in a single pass. Temporary storage would still be
required to validate assertions that call for global or regional
document-knowledge, like IDs or identity constraints, but except for
pathological cases that should be relatively small compared to the entire
document. More important (memory is cheap) it should be fast.

AFAIK there aren't any of those bad puppies out there, but the expert on
this subject is Rick Jelliffe, whom you can reach through
http://www.ascc.net/xml/resource/sch...chematron.html.

Bob Foster
"Mike" <mm********@yahoo.com> wrote in message
news:IP********************@giganews.com...

Related to another topic I just posted, I wanted to discuss ways to optimize the validation of very large (>100MB) XML documents.

First, I have no idea if something like this already exists; it may even be the typical implementation for all I know.

At any rate, it occurs to me that the set of business rules that need to be validated against an XML document represent a limited set of nodes at any
given time (while parsing through the document). For example, if there is a parent<->child node dependency, then only the pertinent information related to those nodes needs to be kept in memory. Once the dependency has been
resolved (by validating the rule), the memory associated with those nodes
could then be freed. In this way, large documents could be validated
efficiently, by only storing information related to dependencies, and
immediately freeing memory once the dependency is resolved.

I don't have a lot of practical XML experience. But I've read, for example, that using a SAX parser can be difficult in cases where you need to maintain a lot of "state" information. So, what I'm asking is whether there is a
general solution to this problem, rather than having application-specific
code to handle the "state" of dependencies?

It seems to me that rule dependencies could be represented by a "graph",
similar in some ways to the Java garbage collector. And, like the garbage
collector, memory could be freed once there are no more "references" to a
particular dependency. The dependencies themselves would be something like "threads" that connect nodes. Larger threads would require more memory.
Further optimization might be achieved by determining if the dependency
threads are better suited for a depth-first or breadth-first traversal, or
some combination.

In my other post, I ask about whether XML Schema can be used for validation of rules like this, or if there are other solutions. In the context of this post, does XML Schema or any other method support any of the concepts I talk about above?

If XML Schema is unable to handle rules like this, and there is no other
available solution, does it make sense that something based on XPath might
work? I'm wondering if the XPath expressions could be used to represent the dependencies (as in what to keep in memory), and then something else would
actually evaluate the dependency.

Thanks for any help/suggestions/comments,

Mike

Jul 20 '05 #5

by: Glen Low | last post by:

I have written a new implemention of the std::valarray library that is optimized to use Altivec (Apple's "Velocity Engine", part of the PowerPC G4's in most Macintoshes and the announced IBM PPC...

C / C++

133

Java's performance far better that optimized C++

by: Gaurav | last post by:

http://www.sys-con.com/story/print.cfm?storyid=45250 Any comments? Thanks Gaurav

C / C++

Optimized code for finding string length

by: SSG | last post by:

Hai All! I need the optimized code for finding the string length. i dont want to use strlen function......... can anyone know reply........ By S.S.G

C / C++

Debug optimized code? (with inlined methods)

by: Fuzzy | last post by:

I want to experiment with the JIT compiler's optimization of code, especially regarding inline compilation of small methods. What I have done is set the Build options for 'Optimize Code' to TRUE...

C# / C Sharp

Is VS2005 optimized for ... ?

by: Jarod | last post by:

Hey Is VS2005 optimized for Intel or AMD processor ? What gives best speed when I develop webprojects ? What really means in the term of speed 2 cores or frequency I assume 64bits. Jarod

C# / C Sharp

what can be optimized?

by: Grizlyk | last post by:

Hello. What can be optimised in C++ code and how i can garantee stable behaviour below 1. Are expression "auto volatile" can deny removing as "unused temporary" like this: auto volatile...

C / C++

assembly always gets built optimized

by: bonk | last post by:

I have a c# project as part of a larger VS 2005 solution that always gets build optimized and I therefore can not evaluate any values while debugging through the code ("Cannot evaluate expression...

.NET Framework

Slow run Speed of optimized code in VS2005 vs VS2003 (Unmanaged C++)

by: Chris288 | last post by:

Hi, I have a problem where our app when compiled in VS2005 runs about 50% the speed it attains in VS2003. This is an unmanaged C++ app. I have tried most combinations of the optimization and...

.NET Framework

Can we use static table adapters in highly concurrent web sites?

by: Max2006 | last post by:

Hi, I am trying to make our business logic layer components more efficient. We use strongly typed datasets and TableAdapters. Is it a good idea to use a static TableAdpater to share the static...

ASP.NET

C# optimized code prevents debugging

by: Zytan | last post by:

I have the simplified build ("show advanced build configurations" turned off), so that pressing F5 runs in DEBUG mode with the debugger. When an assertion fires, I find that I cannot 'watch' some...

C# / C Sharp

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Highly optimized business-rule validation of very large XML documents possible?

Similar topics