Perl + SAX2 = slow?

Jesse Thompson

Greetings fell XML folk.

I've just gotten started making SAX filters in Perl. I was hoping to
build an XML templating engine this way, but the performance of
XML::SAX::Expat and XML::SAX::Writer *appear* to be unthinkably bad.

This code:

XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
))->parse_uri("test.xml");

takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
a 500Mhz, that's 10kilobytes per gigahertz second.

Is this in any way normal? I was hoping to be able to process XML
about a hundred times this fast, maybe 1mb per gigahertz second, or
about a thousand clock cycles per byte of consumed XML. I think that
sounds reasonable for the bare parsing/writing of XML in a zippy
language like Perl.. so I have to assume I am doing something very
very wrong in my setup :(

So am I somehow getting the PurePerl parser instead of Expat? I'm
asking for Expat by name and die($parser) gives me
"XML::SAX::Expat=HASH(0x83b0090)".

Further, I simply cannot believe these results are typical. SAX was
invented to handle multi-megabyte documents that DOM can't fit
in-memory, but at these rates that would mean it would take a dual
4Ghz Xeon server twenty minutes just to parse a 100 megabyte XML
document and write it back out to disk unaltered. What happens when
you want to plug in a pipeline of any merit? I'm not really sure how
fast DOM is, but big servers can have 3 gigabytes of ram (100mb * 30x
for dom memory bloat), and I know my web browser reads XHTML and
builds DOM trees out of it at better than 10 kb per gigahertz second..
So my results must, must be flawed somehow.

Does anyone know what could be going wrong, or how fast that code
snippet should be parsing XML? If I'm waiting around and then only
transforming nested XML nodes that match certain criterion (custom tag
names, attribute names, or maybe even just a custom namespace), sort
of like a templating engine (replace fake template data with data
pulled from a DB, or todays date for instance) would that mean there
is an XML solution more efficient for my goals than SAX? Twig maybe,
or Essex? (I can't find much to read about Essex)

I was hoping to become an XML evangelist because I love everything
about it (and even understand namespaces and encodings ;) but these
results kind of made it feel like my bubble had burst. Everyone I know
keeps molesting their XML projects with Regex, which just seems to me
so much like towing a inoperative car around with a team of horses.

Any insight will be appreciated.

- - Jesse Thompson
Lightsecond Technologies
http://www.lightsecond.com/

Jul 20 '05 #1

Subscribe Post Reply

2455

Jürgen Kahrs

Jesse Thompson wrote:

Further, I simply cannot believe these results are typical. SAX was
invented to handle multi-megabyte documents that DOM can't fit
Reading XML files of several GigaBytes length
with a SAX-implementation (Expat) took me some
minutes. I have tested this while integrating
Expat into GNU Awk.
I was hoping to become an XML evangelist because I love everything
about it (and even understand namespaces and encodings ;) but these
results kind of made it feel like my bubble had burst. Everyone I know
keeps molesting their XML projects with Regex, which just seems to me
so much like towing a inoperative car around with a team of horses.

I cannot help you with Perl, but maybe xmlgawk
can help you. This is GNU Awk extended (experimentally)
with Expat.

Jul 20 '05 #2

Jesse Thompson

Jürgen Kahrs <Ju*********************@vr-web.de> wrote in message news:<2q************@uni-berlin.de>...

Reading XML files of several GigaBytes length
with a SAX-implementation (Expat) took me some
minutes. I have tested this while integrating
Expat into GNU Awk. Yeah, well several gigabytes in several minutes (lets say 1 gigabyte
per minute) on a 1 Gigahertz machine would be 16Mb/Ghz*s. That is so
fast I wouldn't know what to do with myself: one thousand six hundred
times faster than my results are showing (10kb/Ghz*s). Break me off a
peice, would you? :)
I cannot help you with Perl, but maybe xmlgawk
can help you. This is GNU Awk extended (experimentally)
with Expat.

As interesting as that sounds I don't know anything about Awk.. But if
it's fast in awk it must also be fast in Perl. I simply have to
believe my results are atypical for Perl::SAX.

- - Jesse

Jul 20 '05 #3

Jürgen Kahrs

Jesse Thompson wrote:

As interesting as that sounds I don't know anything about Awk.. But if
it's fast in awk it must also be fast in Perl. I simply have to
believe my results are atypical for Perl::SAX.

I am a bit surprised to hear about something built upon
SAX that is slow. Maybe this comparison helps you:

http://xmlbench.sourceforge.net/resu...ark/index.html

Jul 20 '05 #4

Jesse Thompson

> I am a bit surprised to hear about something built upon

SAX that is slow. Maybe this comparison helps you:

http://xmlbench.sourceforge.net/resu...ark/index.html

Well, since you just quoted half the XML parsing benchmarks I've seen
on the internet, I'll go ahead and post the other half ;)
http://www.xml.com/pub/a/Benchmark/article.html

On that page you can notice something like a 30x speed difference
between C expat and Perl Expat. You're quoting a C/Java benchmark.
It's still not enough to explain my results though, Perl Expat (in
what looks like SAX1) is still shown to process 32 times faster than
my demonstration in SAX2. Perl::Expat vs. Perl::SAX::Expat *could*
explain that difference, but I don't buy it. :(

Just to recap, all I have is this:
XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
))->parse_uri("test.xml");

That is the simplest possible command: "eat this file with expat and
write it back out again". There is no application built on top of it
yet, that's just the raw drivers turning my CPU into a radiator. :)

- - Jesse

Jul 20 '05 #5

Malcolm Dew-Jones

Jesse Thompson (je****@gmail.com) wrote:
: Greetings fell XML folk.

: I've just gotten started making SAX filters in Perl. I was hoping to
: build an XML templating engine this way, but the performance of
: XML::SAX::Expat and XML::SAX::Writer *appear* to be unthinkably bad.

: This code:

: XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
: ))->parse_uri("test.xml");

: takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
: a 500Mhz, that's 10kilobytes per gigahertz second.

: Is this in any way normal?

Might I suggest you ask this on

comp.lang.perl.modules

Various people hang out there who might have some definitive
answers, or useful suggestions.

$0.02

Jul 20 '05 #6

Jürgen Kahrs

Jesse Thompson wrote:

Well, since you just quoted half the XML parsing benchmarks I've seen
on the internet, I'll go ahead and post the other half ;)
http://www.xml.com/pub/a/Benchmark/article.html

Thanks for the link. Interesting.

Jul 20 '05 #7

Bjoern Hoehrmann

* Jesse Thompson wrote in comp.text.xml:

This code:

XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
))->parse_uri("test.xml");

takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
a 500Mhz, that's 10kilobytes per gigahertz second.

That does not say much. Did you eliminate all overhead from this test?
Like loading Perl, all the modules, initializing the parser, etc? If
not, that would explain a lot. Also note that a 10kb document is a bad
test case if you want to know how it performs with several MBs of data.
Further note that XML::SAX::Expat is a bad choice if you desire good
performance, what you are doing here is

Expat -> XML::Parser::Expat -> XML::Parser -> XML::SAX::Expat

along with other modules like XML::SAX::Base in the chain which is quite
some overhead. A better choice would be to use XML::SAX::ExpatXS which
omits XML::Parser::Expat and XML::Parser and other parts from the chain,
and is thus much faster. Other modules like XML::LibXML::SAX might give
even better performance. Another problem with your test is that you
generate XML in the chain through XML::SAX::Writer and then through
STDOUT, both of which might significantly slow down processing speed.
In other words, there might be many reasons why this might show poor
performance.

Using http://lists.w3.org/Archives/Public/...4Mar/0169.html
as input your script takes 96 seconds on a 1066 MHz Mobile Celeron FWIW.
That's about 77KB per second. With no handler 40 seconds, 184KB/s. With
XML::LibXML::SAX and no handler 18 seconds, 409KB/s. And with the direct
Expat wrapper XML::SAX::ExpatXS and no handler 9 seconds, 819KB/s. Each
for a single run though, so the results are not all that meaningful. For
better results see `perldoc Benchmark`.

Jul 20 '05 #8

Jürgen Kahrs

Bjoern Hoehrmann wrote:

Using http://lists.w3.org/Archives/Public/...4Mar/0169.html
as input your script takes 96 seconds on a 1066 MHz Mobile Celeron FWIW.
That's about 77KB per second. With no handler 40 seconds, 184KB/s. With
XML::LibXML::SAX and no handler 18 seconds, 409KB/s. And with the direct
Expat wrapper XML::SAX::ExpatXS and no handler 9 seconds, 819KB/s. Each
for a single run though, so the results are not all that meaningful. For
better results see `perldoc Benchmark`.

This is interesting. For comparison: On a 1200 MHz AMD Duron,
xmlgawk parses between 4000 and 5000 KB/s.

Jul 20 '05 #9

Bjoern Hoehrmann

* Jürgen Kahrs wrote in comp.text.xml:

Using http://lists.w3.org/Archives/Public/...4Mar/0169.html
as input your script takes 96 seconds on a 1066 MHz Mobile Celeron FWIW.
That's about 77KB per second. With no handler 40 seconds, 184KB/s. With
XML::LibXML::SAX and no handler 18 seconds, 409KB/s. And with the direct
Expat wrapper XML::SAX::ExpatXS and no handler 9 seconds, 819KB/s. Each
for a single run though, so the results are not all that meaningful. For
better results see `perldoc Benchmark`.

This is interesting. For comparison: On a 1200 MHz AMD Duron,
xmlgawk parses between 4000 and 5000 KB/s.

As I wrote, these results don't tell you much. If you have no handler
the processor might be optimized to ignore all data and just evaluate
the document for well-formedness; others might not as it is uncommon to
have no start_element handler, for example. SGML::Parser::OpenSP, a soon
to be released SGML/XML processor based on OpenSP is optimized like that
and for 100 iterations for the documented generated via

`get http://www.w3.org/TR/REC-xml | tidy -utf8 -n --doctype omit`

with no handler versus with handlers for all events that don't do
anything,

Rate OpenSP1 OpenSP2
OpenSP1 1.34/s -- -87%
OpenSP2 10.4/s 677% --

which just shows that creating Perl data structures is quite expensive.
XML::Parser has similar optimizations,

Rate XML::Parser1 XML::Parser2
XML::Parser1 7.25/s -- -81%
XML::Parser2 39.0/s 438% --

and both compared

Rate OpenSP2 XML::Parser2 OpenSP1 XML::Parser1
OpenSP2 1.35/s -- -82% -87% -97%
XML::Parser2 7.34/s 445% -- -31% -81%
OpenSP1 10.6/s 685% 44% -- -73%
XML::Parser1 39.0/s 2795% 431% 269% --

XML::SAX::Expat is not optimized like that in any way and does many more
things than what XML::Parser1 would do. And as I wrote, the input is
highly relevant, too, using the 7,2MB example above (which is quite
different in terms of markup/pcdata, etc.) I get

s/iter OpenSP1 XML::Parser1
OpenSP1 3.89 -- -82%
XML::Parser1 0.681 470% --

which is quite different from the 269% before. So here you'd get a rate
of about 10MB/s using XML::Parser with throw-away processing versus the
about 2MB/s of SGML::Parser::OpenSP.

Jul 20 '05 #10

Jürgen Kahrs

Bjoern Hoehrmann wrote:

As I wrote, these results don't tell you much. If you have no handler
the processor might be optimized to ignore all data and just evaluate
the document for well-formedness; others might not as it is uncommon to
have no start_element handler, for example. SGML::Parser::OpenSP, a soon
Good argument, of course. But I really made sure that
all handlers were active. xmlgawk is stupid enough to
have all handlers active all the time. I also made sure
that the XML file was really parsed.
things than what XML::Parser1 would do. And as I wrote, the input is
highly relevant, too, using the 7,2MB example above (which is quite
different in terms of markup/pcdata, etc.) I get

s/iter OpenSP1 XML::Parser1
OpenSP1 3.89 -- -82%
XML::Parser1 0.681 470% --

which is quite different from the 269% before. So here you'd get a rate
of about 10MB/s using XML::Parser with throw-away processing versus the
about 2MB/s of SGML::Parser::OpenSP.

Interesting, I just downloaded the file and xmlgawk (based
on Expat) parses around 2MB/s on a Pentium with 550 MHz;
which is not much different from 4MB/s with a Duron 1200 MHz.

Jul 20 '05 #11

William Park

J?rgen Kahrs <Ju*********************@vr-web.de> wrote:

Bjoern Hoehrmann wrote:
As I wrote, these results don't tell you much. If you have no handler
the processor might be optimized to ignore all data and just evaluate
the document for well-formedness; others might not as it is uncommon to
have no start_element handler, for example. SGML::Parser::OpenSP, a soon

Good argument, of course. But I really made sure that
all handlers were active. xmlgawk is stupid enough to
have all handlers active all the time. I also made sure
that the XML file was really parsed.
things than what XML::Parser1 would do. And as I wrote, the input is
highly relevant, too, using the 7,2MB example above (which is quite
different in terms of markup/pcdata, etc.) I get

s/iter OpenSP1 XML::Parser1
OpenSP1 3.89 -- -82%
XML::Parser1 0.681 470% --

which is quite different from the 269% before. So here you'd get a rate
of about 10MB/s using XML::Parser with throw-away processing versus the
about 2MB/s of SGML::Parser::OpenSP.

Interesting, I just downloaded the file and xmlgawk (based
on Expat) parses around 2MB/s on a Pentium with 550 MHz;
which is not much different from 4MB/s with a Duron 1200 MHz.

On my P3/800, the 7.5MB file (W3C-Member-Validity.xml) takes
- 6sec for Bash + Expat --> 1.2MB/s
- 2.3sec for Gawk + Expat --> 3.2MB/s
which is in agreement with you data.

--
William Park <op**********@yahoo.ca>
Open Geometry Consulting, Toronto, Canada

Jul 20 '05 #12

Jesse Thompson

Thank you Jürgen and Bjoern, that has all been very enlightning :)

So the way Bjoern puts it, Expat is quite fast and due to some
streamlining ExpatXS is faster (it bumped me up to 33kb/Ghz*s) (libXML
appears incompatable with my Glib2.2 system at the moment) but that
XML::SAX::Writer is very very slow.

The reason I'm keeping that in is because it /is/ a constant in my
operations.. in that I'm trying to set up a mechanism where I read an
XML file, I have a flexible chain of filters that transform the file
in the SAX stream, and then it gets written back out again. I don't
know if there's an obvious way to factor out need for Writer in a
scenario like this. Passing an event to a C-module has to be faster
than writing my own writing routines in Perl to avoid the SAX faucet.

But if Writer is the slowbe I guess I'll want to research a faster
Writer? Also I have to wonder about SAX in general for my goals. I
don't think SAX supports any prefilters. My filters will be looking
for XML tags with certain names or attributes before they begin doing
their jobs, so if there was a prefilter, like my filter said "skip me
unless tagname =~ /^abc/ (or) tagname eq 'abc' (or)
defined(attribute->{'{mynamespace}process'})" that might make things
much quicker.

Otherwise, since nearly all of my files will be less than 200k, maybe
I should start looking at DOM.

Jürgen: your xmlgawk project sounds very very interesting. I took a
look at some Gawk tutorials and it looks like a capable tool for very
many applications, more so with XML support. Is there anywhere I could
snag it from? Google seems only to know of some of your discussions
with William over the project :)

Thank you again!

- - Jesse

Jul 20 '05 #13

Jürgen Kahrs

Jesse Thompson wrote:

Jürgen: your xmlgawk project sounds very very interesting. I took a
look at some Gawk tutorials and it looks like a capable tool for very
many applications, more so with XML support. Is there anywhere I could
snag it from? Google seems only to know of some of your discussions
with William over the project :)

There is very few doc about xmlgawk currently.
A collection of pointers can be found in my
posting to comp.lang.awk on 2004-08-13:

http://groups.google.de/groups?hl=de...Dcomp.lang.awk

Jul 20 '05 #14

by: Hunn E. Balsiche | last post by:

in term of its OO features, syntax consistencies, ease of use, and their development progress. I have not use python but heard about it quite often; and ruby, is it mature enough to be use for...

Python

Comments in SAX2

by: Christian Roth | last post by:

Hi, as far as I can see there is no callback method defined for comments in SAX2's ContentHandler interface. 1) Why is this omission? 2) How can I get also the contents of XML comments from...

.NET Framework

Javascript Vs Perl RegExp

by: phal | last post by:

Hi all; I code Perl for CGI, I using regular expression to check the validation of user input, because the form is small and it run only from my own computer, anyways if many people using my...

Javascript

Python equivalent of Perl-ISAPI?

by: rurpy | last post by:

Is there an effcient way (more so than cgi) of using Python with Microsoft IIS? Something equivalent to Perl-ISAPI?

Python

SAX2 Download

by: Mike P | last post by:

Does anyone know where i can download acopy of the SAX2 module? Cheers Mike

Python

Xerces SAX2 Parser termination

by: aphillips | last post by:

I have an application that uses the Xerces SAX2 parser. My problem is that if I encounter a particular error in one of my callback methods I would like to terminate the parsing. Of course, the...

XML

how to parse an xml message in std::string (C++) using xerces sax2

by: muralibala68 | last post by:

Hi, How do I parse an xml message that is in std::string using sax2 parse method that does not seem to take a string and expects a filename? Thanks.

.NET Framework

Sorting data with perl - Part Two

by: KevinADC | last post by:

Introduction In part one we discussed the default sort function. In part two we will discuss more advanced techniques you can use to sort data. Some of the techniques might introduce unfamiliar...

Perl

Xerces C++ Parser - SAX2 Parser - Online Tutorials & Books

by: gvk123 | last post by:

Hi All, I am new to C++ Programming. Our Requirement is to parse the XML Responses generated from the Java Web Services. We are using the Xerces-Sax2 parser. Please any one suggest me the best...

C / C++

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Similar topics