Connecting Tech Pros Worldwide Forums | Help | Site Map

XML::Simple in perl?

Dan
Guest
 
Posts: n/a
#1: Jul 20 '05
Using XML::Simple in perl is
extreemly slow to parse big
XML files (can be up to 250M,
taking ~1h).

How can I increase my performance /
reduce my memory usage?

Is SAX the way forward?

Should I consider using (learning)
Expat.c for increased performance?

How long would parsing a 250M XML
file take with Expat?

Thanks for any suggestions you can give,
Dan.


Janek Schleicher
Guest
 
Posts: n/a
#2: Jul 20 '05

re: XML::Simple in perl?


yDan wrote at Thu, 31 Jul 2003 12:41:24 +0100:
[color=blue]
> Using XML::Simple in perl is
> extreemly slow to parse big
> XML files (can be up to 250M,
> taking ~1h).[/color]

XML::Simple is not the only module on CPAN.
There are also
XML::Smart
XML::Parser
XML::LibXML
....
[color=blue]
> How can I increase my performance /
> reduce my memory usage?
>
> Is SAX the way forward?[/color]

Note that SAX is basical an API for XML triggering starting and ending
tags and similar. That's different to the previous mentioned modules, as
they also generate a huge tree view to the XML document.

But SAX can reduce the memory wastage and gain in time advantage.
In Perl you can use e.g. XML::SAX for it.
[color=blue]
> Should I consider using (learning)
> Expat.c for increased performance?[/color]

XML::Parser and XML::SAX::Expat are e.g. based on Expat, so you might use
the Perl interfaces if you want.



Greetings,
Janek
Dan
Guest
 
Posts: n/a
#3: Jul 20 '05

re: XML::Simple in perl?




Janek Schleicher wrote:[color=blue]
> yDan wrote at Thu, 31 Jul 2003 12:41:24 +0100:
>
>[color=green]
>>Using XML::Simple in perl is
>>extreemly slow to parse big
>>XML files (can be up to 250M,
>>taking ~1h).[/color]
>
>
> XML::Simple is not the only module on CPAN.
> There are also
> XML::Smart
> XML::Parser
> XML::LibXML
> ...[/color]

Which ones have the same 'front end' as
XML simple? I would rather not change
code if I don't have to.

Which one uses least memory on big files?


[color=blue]
>
>[color=green]
>>How can I increase my performance /
>>reduce my memory usage?
>>
>>Is SAX the way forward?[/color]
>
>
> Note that SAX is basical an API for XML triggering starting and ending
> tags and similar. That's different to the previous mentioned modules, as
> they also generate a huge tree view to the XML document.
>
> But SAX can reduce the memory wastage and gain in time advantage.
> In Perl you can use e.g. XML::SAX for it.
>
>[color=green]
>>Should I consider using (learning)
>>Expat.c for increased performance?[/color]
>
>
> XML::Parser and XML::SAX::Expat are e.g. based on Expat, so you might use
> the Perl interfaces if you want.[/color]

I think I will end up doing this...

Is it any slower?

My requirements are extreemly simple, I just need
to print tab delimited lines to a file, so I was
thinking I could take this opportunity to try
to learn c....

Cheers,
Dan.


[color=blue]
>
>
>
> Greetings,
> Janek[/color]

Tad McClellan
Guest
 
Posts: n/a
#4: Jul 20 '05

re: XML::Simple in perl?


Dan <dmb@mrc-dunn.cam.ac.uk> wrote:[color=blue]
> Janek Schleicher wrote:[color=green]
>> yDan wrote at Thu, 31 Jul 2003 12:41:24 +0100:[/color][/color]
[color=blue][color=green][color=darkred]
>>>extreemly slow to parse big[/color][/color][/color]
[color=blue][color=green]
>> There are also
>> XML::Smart
>> XML::Parser
>> XML::LibXML
>> ...[/color]
>
> Which ones have the same 'front end' as
> XML simple?[/color]


None of them.

The "simple" front end is why the module is called Simple.

[color=blue]
> I would rather not change
> code if I don't have to.[/color]


Too late. :-(

[color=blue]
> Which one uses least memory on big files?[/color]


None of those ones.

[color=blue][color=green][color=darkred]
>>>How can I increase my performance /
>>>reduce my memory usage?
>>>
>>>Is SAX the way forward?[/color][/color][/color]
[color=blue][color=green]
>> But SAX can reduce the memory wastage and gain in time advantage.[/color][/color]


That one.



There is also a mailing list specifically for doing XML processing
using Perl:

http://listserv.ActiveState.com/mail...tinfo/perl-xml


--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
cp
Guest
 
Posts: n/a
#5: Jul 20 '05

re: XML::Simple in perl?


In article <3F290064.6070003@mrc-dunn.cam.ac.uk>, Dan
<dmb@mrc-dunn.cam.ac.uk> wrote:
[color=blue]
> Using XML::Simple in perl is
> extreemly slow to parse big
> XML files (can be up to 250M,
> taking ~1h).[/color]

XML::Simple is limited. The author points it out in the docs. It was
designed for a fairly specific purpose, parsing small configuration
files written in XML. It has since been expanded on, but it's core
remains, well, simple.
[color=blue]
> How can I increase my performance /
> reduce my memory usage?
>
> Is SAX the way forward?[/color]

The docs for XML::Simple Version 2.05 suggest that you can (as of
version 1.08) use a SAX parser with XML::Simple, the benefits of which
are:

Applications written to the SAX API can extract data
from huge XML documents without the memory overheads
of a DOM or tree API.

So you might read the docs, use a SAX parser, and see some speed
benefit. You might not.

--
cp
Dan
Guest
 
Posts: n/a
#6: Jul 20 '05

re: XML::Simple in perl?


I am now using XML::Parser,
which is working nicely, apart
from the occasional weird behaviour,
in some cases characters go missing,
(i.e. I get a 1 instead of 19).

The error is persistant, i.e. not a random
caracter, but the same character each time
goes missing.

It is really confusing.

Also my Char event gets called 3 times
per tag, even though there are no new
lines anywhere in teh tag text.

Little frustrating problems....

Dan.

cp wrote:[color=blue]
> In article <3F290064.6070003@mrc-dunn.cam.ac.uk>, Dan
> <dmb@mrc-dunn.cam.ac.uk> wrote:
>
>[color=green]
>>Using XML::Simple in perl is
>>extreemly slow to parse big
>>XML files (can be up to 250M,
>>taking ~1h).[/color]
>
>
> XML::Simple is limited. The author points it out in the docs. It was
> designed for a fairly specific purpose, parsing small configuration
> files written in XML. It has since been expanded on, but it's core
> remains, well, simple.
>
>[color=green]
>>How can I increase my performance /
>>reduce my memory usage?
>>
>>Is SAX the way forward?[/color]
>
>
> The docs for XML::Simple Version 2.05 suggest that you can (as of
> version 1.08) use a SAX parser with XML::Simple, the benefits of which
> are:
>
> Applications written to the SAX API can extract data
> from huge XML documents without the memory overheads
> of a DOM or tree API.
>
> So you might read the docs, use a SAX parser, and see some speed
> benefit. You might not.
>[/color]

Michel Rodriguez
Guest
 
Posts: n/a
#7: Jul 20 '05

re: XML::Simple in perl?


Tad McClellan wrote:
[color=blue]
> Dan <dmb@mrc-dunn.cam.ac.uk> wrote:[color=green]
>> Janek Schleicher wrote:[color=darkred]
>>> yDan wrote at Thu, 31 Jul 2003 12:41:24 +0100:[/color][/color]
>[color=green][color=darkred]
>>>> [XML::Simple is] extreemly slow to parse big [files][/color][/color]
>[color=green][color=darkred]
>>> There are also
>>> XML::Smart
>>> XML::Parser
>>> XML::LibXML
>>> ...[/color]
>>
>> Which ones have the same 'front end' as XML simple?[/color]
>
> None of them.
>
> The "simple" front end is why the module is called Simple.[/color]

As a matter of fact XML::Smart has the same 'front end', and XML::Twig has a
method names 'simplify', which generates the same data structure as
XML::Simple, on a tree or on a sub-tree.

One possible cause for the problem might be that XML::Simple could be using
XML::SAX::PurePerl as its parser, which is very slow. This depends on your
installation. If you have XML::LibXML or XML::Parser installed you can set
the $XML::Simple::PREFERRED_PARSER variable to tell it to use an other
parser, see the docs.
[color=blue][color=green]
>> Which one uses least memory on big files?[/color][/color]

You might want to have a look at XML::Twig, which is specially designed for
big files (but I might be slightly biased ;--)

__
Michel Rodriguez
Perl &amp; XML
http://xmltwig.com
Tad McClellan
Guest
 
Posts: n/a
#8: Jul 20 '05

re: XML::Simple in perl?


Dan <dmb@mrc-dunn.cam.ac.uk> wrote:[color=blue]
> I am now using XML::Parser,
> which is working nicely, apart
> from the occasional weird behaviour,
> in some cases characters go missing,[/color]
[color=blue]
> The error is persistant, i.e. not a random
> caracter, but the same character each time
> goes missing.[/color]


If you show us a short and complete program that we can run that
illustrates your problem, then we can surely help you solve
your problem.

[color=blue]
> It is really confusing.[/color]


Can't help with unseen code...

[color=blue]
> Also my Char event gets called 3 times
> per tag,[/color]


That is "normal".

You'll get the PCDATA in dribs and drabs, so you need to keep collecting
it until you reach the end of the containing element.

[color=blue]
> even though there are no new
> lines anywhere in teh tag text.[/color]


The concept of "lines" is not present in XML.

Remove every newline from an XML document, and it is still
an XML document.



[snip upside-down quoting]

--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas
Dan
Guest
 
Posts: n/a
#9: Jul 20 '05

re: XML::Simple in perl?


Ta, the problem is fixed now,
I forgot to unset my global $currentTag
in the &endTag event handler, leading to
the 'dribs and drabs' below, which actually
belonged to outer tags (I was mistakenly
giving them to $currentTag).

With this bug gone I can now safely

$data{$currentTag} .= $text;

where I had been

$data{$currentTag} = $text if !$data{$currentTag};

Hence my occasional missing characters.

Thanks very much for all the kind help
and advice,

Regards,

Dan.

DIY GENOME...
perl -e '@A=qw(A T C G); for(1..10**6){print $A[rand(@A)]}' > \
myGenome.txt



Tad McClellan wrote:[color=blue]
> Dan <dmb@mrc-dunn.cam.ac.uk> wrote:
>[color=green]
>>I am now using XML::Parser,
>>which is working nicely, apart
>>from the occasional weird behaviour,
>>in some cases characters go missing,[/color]
>
>[color=green]
>>The error is persistant, i.e. not a random
>>caracter, but the same character each time
>>goes missing.[/color]
>
>
>
> If you show us a short and complete program that we can run that
> illustrates your problem, then we can surely help you solve
> your problem.
>
>
>[color=green]
>>It is really confusing.[/color]
>
>
>
> Can't help with unseen code...
>
>
>[color=green]
>>Also my Char event gets called 3 times
>>per tag,[/color]
>
>
>
> That is "normal".
>
> You'll get the PCDATA in dribs and drabs, so you need to keep collecting
> it until you reach the end of the containing element.
>
>
>[color=green]
>>even though there are no new
>>lines anywhere in teh tag text.[/color]
>
>
>
> The concept of "lines" is not present in XML.
>
> Remove every newline from an XML document, and it is still
> an XML document.
>
>
>
> [snip upside-down quoting]
>[/color]

Closed Thread