473,624 Members | 2,119 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Perl + SAX2 = slow?

Greetings fell XML folk.

I've just gotten started making SAX filters in Perl. I was hoping to
build an XML templating engine this way, but the performance of
XML::SAX::Expat and XML::SAX::Write r *appear* to be unthinkably bad.

This code:

XML::SAX::Expat->new(Handler => XML::SAX::Write r->new( Output => '>-'
))->parse_uri("tes t.xml");

takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
a 500Mhz, that's 10kilobytes per gigahertz second.

Is this in any way normal? I was hoping to be able to process XML
about a hundred times this fast, maybe 1mb per gigahertz second, or
about a thousand clock cycles per byte of consumed XML. I think that
sounds reasonable for the bare parsing/writing of XML in a zippy
language like Perl.. so I have to assume I am doing something very
very wrong in my setup :(

So am I somehow getting the PurePerl parser instead of Expat? I'm
asking for Expat by name and die($parser) gives me
"XML::SAX::Expa t=HASH(0x83b009 0)".

Further, I simply cannot believe these results are typical. SAX was
invented to handle multi-megabyte documents that DOM can't fit
in-memory, but at these rates that would mean it would take a dual
4Ghz Xeon server twenty minutes just to parse a 100 megabyte XML
document and write it back out to disk unaltered. What happens when
you want to plug in a pipeline of any merit? I'm not really sure how
fast DOM is, but big servers can have 3 gigabytes of ram (100mb * 30x
for dom memory bloat), and I know my web browser reads XHTML and
builds DOM trees out of it at better than 10 kb per gigahertz second..
So my results must, must be flawed somehow.

Does anyone know what could be going wrong, or how fast that code
snippet should be parsing XML? If I'm waiting around and then only
transforming nested XML nodes that match certain criterion (custom tag
names, attribute names, or maybe even just a custom namespace), sort
of like a templating engine (replace fake template data with data
pulled from a DB, or todays date for instance) would that mean there
is an XML solution more efficient for my goals than SAX? Twig maybe,
or Essex? (I can't find much to read about Essex)

I was hoping to become an XML evangelist because I love everything
about it (and even understand namespaces and encodings ;) but these
results kind of made it feel like my bubble had burst. Everyone I know
keeps molesting their XML projects with Regex, which just seems to me
so much like towing a inoperative car around with a team of horses.

Any insight will be appreciated.

- - Jesse Thompson
Lightsecond Technologies
http://www.lightsecond.com/
Jul 20 '05 #1
13 2473
Jesse Thompson wrote:
Further, I simply cannot believe these results are typical. SAX was
invented to handle multi-megabyte documents that DOM can't fit
Reading XML files of several GigaBytes length
with a SAX-implementation (Expat) took me some
minutes. I have tested this while integrating
Expat into GNU Awk.
I was hoping to become an XML evangelist because I love everything
about it (and even understand namespaces and encodings ;) but these
results kind of made it feel like my bubble had burst. Everyone I know
keeps molesting their XML projects with Regex, which just seems to me
so much like towing a inoperative car around with a team of horses.


I cannot help you with Perl, but maybe xmlgawk
can help you. This is GNU Awk extended (experimentally )
with Expat.
Jul 20 '05 #2
Jürgen Kahrs <Ju************ *********@vr-web.de> wrote in message news:<2q******* *****@uni-berlin.de>...
Reading XML files of several GigaBytes length
with a SAX-implementation (Expat) took me some
minutes. I have tested this while integrating
Expat into GNU Awk. Yeah, well several gigabytes in several minutes (lets say 1 gigabyte
per minute) on a 1 Gigahertz machine would be 16Mb/Ghz*s. That is so
fast I wouldn't know what to do with myself: one thousand six hundred
times faster than my results are showing (10kb/Ghz*s). Break me off a
peice, would you? :)
I cannot help you with Perl, but maybe xmlgawk
can help you. This is GNU Awk extended (experimentally )
with Expat.


As interesting as that sounds I don't know anything about Awk.. But if
it's fast in awk it must also be fast in Perl. I simply have to
believe my results are atypical for Perl::SAX.

- - Jesse
Jul 20 '05 #3
Jesse Thompson wrote:
As interesting as that sounds I don't know anything about Awk.. But if
it's fast in awk it must also be fast in Perl. I simply have to
believe my results are atypical for Perl::SAX.


I am a bit surprised to hear about something built upon
SAX that is slow. Maybe this comparison helps you:

http://xmlbench.sourceforge.net/resu...ark/index.html
Jul 20 '05 #4
> I am a bit surprised to hear about something built upon
SAX that is slow. Maybe this comparison helps you:

http://xmlbench.sourceforge.net/resu...ark/index.html


Well, since you just quoted half the XML parsing benchmarks I've seen
on the internet, I'll go ahead and post the other half ;)
http://www.xml.com/pub/a/Benchmark/article.html

On that page you can notice something like a 30x speed difference
between C expat and Perl Expat. You're quoting a C/Java benchmark.
It's still not enough to explain my results though, Perl Expat (in
what looks like SAX1) is still shown to process 32 times faster than
my demonstration in SAX2. Perl::Expat vs. Perl::SAX::Expa t *could*
explain that difference, but I don't buy it. :(

Just to recap, all I have is this:
XML::SAX::Expat->new(Handler => XML::SAX::Write r->new( Output => '>-'
))->parse_uri("tes t.xml");

That is the simplest possible command: "eat this file with expat and
write it back out again". There is no application built on top of it
yet, that's just the raw drivers turning my CPU into a radiator. :)

- - Jesse
Jul 20 '05 #5
Jesse Thompson (je****@gmail.c om) wrote:
: Greetings fell XML folk.

: I've just gotten started making SAX filters in Perl. I was hoping to
: build an XML templating engine this way, but the performance of
: XML::SAX::Expat and XML::SAX::Write r *appear* to be unthinkably bad.

: This code:

: XML::SAX::Expat->new(Handler => XML::SAX::Write r->new( Output => '>-'
: ))->parse_uri("tes t.xml");

: takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
: a 500Mhz, that's 10kilobytes per gigahertz second.

: Is this in any way normal?

Might I suggest you ask this on

comp.lang.perl. modules

Various people hang out there who might have some definitive
answers, or useful suggestions.

$0.02
Jul 20 '05 #6
Jesse Thompson wrote:
Well, since you just quoted half the XML parsing benchmarks I've seen
on the internet, I'll go ahead and post the other half ;)
http://www.xml.com/pub/a/Benchmark/article.html


Thanks for the link. Interesting.
Jul 20 '05 #7
* Jesse Thompson wrote in comp.text.xml:
This code:

XML::SAX::Expa t->new(Handler => XML::SAX::Write r->new( Output => '>-'
))->parse_uri("tes t.xml");

takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
a 500Mhz, that's 10kilobytes per gigahertz second.


That does not say much. Did you eliminate all overhead from this test?
Like loading Perl, all the modules, initializing the parser, etc? If
not, that would explain a lot. Also note that a 10kb document is a bad
test case if you want to know how it performs with several MBs of data.
Further note that XML::SAX::Expat is a bad choice if you desire good
performance, what you are doing here is

Expat -> XML::Parser::Ex pat -> XML::Parser -> XML::SAX::Expat

along with other modules like XML::SAX::Base in the chain which is quite
some overhead. A better choice would be to use XML::SAX::Expat XS which
omits XML::Parser::Ex pat and XML::Parser and other parts from the chain,
and is thus much faster. Other modules like XML::LibXML::SA X might give
even better performance. Another problem with your test is that you
generate XML in the chain through XML::SAX::Write r and then through
STDOUT, both of which might significantly slow down processing speed.
In other words, there might be many reasons why this might show poor
performance.

Using http://lists.w3.org/Archives/Public/...4Mar/0169.html
as input your script takes 96 seconds on a 1066 MHz Mobile Celeron FWIW.
That's about 77KB per second. With no handler 40 seconds, 184KB/s. With
XML::LibXML::SA X and no handler 18 seconds, 409KB/s. And with the direct
Expat wrapper XML::SAX::Expat XS and no handler 9 seconds, 819KB/s. Each
for a single run though, so the results are not all that meaningful. For
better results see `perldoc Benchmark`.
Jul 20 '05 #8
Bjoern Hoehrmann wrote:
Using http://lists.w3.org/Archives/Public/...4Mar/0169.html
as input your script takes 96 seconds on a 1066 MHz Mobile Celeron FWIW.
That's about 77KB per second. With no handler 40 seconds, 184KB/s. With
XML::LibXML::SA X and no handler 18 seconds, 409KB/s. And with the direct
Expat wrapper XML::SAX::Expat XS and no handler 9 seconds, 819KB/s. Each
for a single run though, so the results are not all that meaningful. For
better results see `perldoc Benchmark`.


This is interesting. For comparison: On a 1200 MHz AMD Duron,
xmlgawk parses between 4000 and 5000 KB/s.
Jul 20 '05 #9
* Jürgen Kahrs wrote in comp.text.xml:
Using http://lists.w3.org/Archives/Public/...4Mar/0169.html
as input your script takes 96 seconds on a 1066 MHz Mobile Celeron FWIW.
That's about 77KB per second. With no handler 40 seconds, 184KB/s. With
XML::LibXML::SA X and no handler 18 seconds, 409KB/s. And with the direct
Expat wrapper XML::SAX::Expat XS and no handler 9 seconds, 819KB/s. Each
for a single run though, so the results are not all that meaningful. For
better results see `perldoc Benchmark`.


This is interesting. For comparison: On a 1200 MHz AMD Duron,
xmlgawk parses between 4000 and 5000 KB/s.


As I wrote, these results don't tell you much. If you have no handler
the processor might be optimized to ignore all data and just evaluate
the document for well-formedness; others might not as it is uncommon to
have no start_element handler, for example. SGML::Parser::O penSP, a soon
to be released SGML/XML processor based on OpenSP is optimized like that
and for 100 iterations for the documented generated via

`get http://www.w3.org/TR/REC-xml | tidy -utf8 -n --doctype omit`

with no handler versus with handlers for all events that don't do
anything,

Rate OpenSP1 OpenSP2
OpenSP1 1.34/s -- -87%
OpenSP2 10.4/s 677% --

which just shows that creating Perl data structures is quite expensive.
XML::Parser has similar optimizations,

Rate XML::Parser1 XML::Parser2
XML::Parser1 7.25/s -- -81%
XML::Parser2 39.0/s 438% --

and both compared

Rate OpenSP2 XML::Parser2 OpenSP1 XML::Parser1
OpenSP2 1.35/s -- -82% -87% -97%
XML::Parser2 7.34/s 445% -- -31% -81%
OpenSP1 10.6/s 685% 44% -- -73%
XML::Parser1 39.0/s 2795% 431% 269% --

XML::SAX::Expat is not optimized like that in any way and does many more
things than what XML::Parser1 would do. And as I wrote, the input is
highly relevant, too, using the 7,2MB example above (which is quite
different in terms of markup/pcdata, etc.) I get

s/iter OpenSP1 XML::Parser1
OpenSP1 3.89 -- -82%
XML::Parser1 0.681 470% --

which is quite different from the 269% before. So here you'd get a rate
of about 10MB/s using XML::Parser with throw-away processing versus the
about 2MB/s of SGML::Parser::O penSP.
Jul 20 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

77
4030
by: Hunn E. Balsiche | last post by:
in term of its OO features, syntax consistencies, ease of use, and their development progress. I have not use python but heard about it quite often; and ruby, is it mature enough to be use for developing serious application, e.g web application as it has not many features in it yet. I've given up on Perl for its ugly syntax and it is not the easiest language to learn. How about PHP? Thanks
1
1341
by: Christian Roth | last post by:
Hi, as far as I can see there is no callback method defined for comments in SAX2's ContentHandler interface. 1) Why is this omission? 2) How can I get also the contents of XML comments from a source XML file via a SAX-like interface?
3
2456
by: phal | last post by:
Hi all; I code Perl for CGI, I using regular expression to check the validation of user input, because the form is small and it run only from my own computer, anyways if many people using my form, do you think it will be slow due to Perl is run under server. How about using JavaSCript to check validation for user input? Do you think it a bit faster?
12
2345
by: rurpy | last post by:
Is there an effcient way (more so than cgi) of using Python with Microsoft IIS? Something equivalent to Perl-ISAPI?
2
1683
by: Mike P | last post by:
Does anyone know where i can download acopy of the SAX2 module? Cheers Mike
1
1519
by: aphillips | last post by:
I have an application that uses the Xerces SAX2 parser. My problem is that if I encounter a particular error in one of my callback methods I would like to terminate the parsing. Of course, the callback methods are void functions so there's no way to return any info to the parser via the function return. Therefore, I believe that there must be some function I can call to tell the parser to effectively flush the rest of the XML file. Does...
2
5428
by: muralibala68 | last post by:
Hi, How do I parse an xml message that is in std::string using sax2 parse method that does not seem to take a string and expects a filename? Thanks.
1
7181
KevinADC
by: KevinADC | last post by:
Introduction In part one we discussed the default sort function. In part two we will discuss more advanced techniques you can use to sort data. Some of the techniques might introduce unfamiliar methods or syntax to a less experienced perl coder. I will post links to online resources you can read if necessary. Experienced perl coders might find nothing new or useful contained in this article. Short Review In part one I showed you some...
1
2973
by: gvk123 | last post by:
Hi All, I am new to C++ Programming. Our Requirement is to parse the XML Responses generated from the Java Web Services. We are using the Xerces-Sax2 parser. Please any one suggest me the best tutorials and books(online). Thanks in Advance.
0
8172
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8677
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
8335
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8474
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
5563
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4079
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4174
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1784
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
2
1482
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.