lisp is winner in DOM parsing contest! 8-]

Alex Mizrahi

Hello, All!

i have 3mb long XML document with about 150000 lines (i think it has about
200000 elements there) which i want to parse to DOM to work with.
first i thought there will be no problems, but there were..
first i tried Python.. there's special interest group that wants to "make
Python become the premier language for XML processing" so i thought there
will be no problems with this document..
i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
it - i don't want such processing.. i think this is because of that fat
class impementation - possibly Python had some significant overhead for each
object instance, or something like this..

then i asdf-installed s-xml package and tried it with it. it ate only 25
megs for lxml representation. i think interning element names helped a lot..
it was CLISP that has unicode inside, so i think it could be even less
without unicode..

then i tried C++ - TinyXML. it was fast, but ate 65 megs.. ye, looks like
interning helps a lot 8-]

then i tried Perl XML::DOM.. it was better than python - about 180megs, but
it was slowest.. at least it consumed mem slower than python 8-]

and java.. with default parser it took 45mbs.. maybe it interned strings,
but there was overhead from classes - storing trees is definitely what's
lisp optimized for 8-]

so lisp is winner.. but it has not standard way (even no non-standard but
simple) way to write binary IEEE floating point representation, so common
lisp suck and i will use c++ for my task.. 8-]]]

With best regards, Alex 'killer_storm' Mizrahi.

Jul 18 '05 #1

Subscribe Reply

2021

Hans Nowak

Alex Mizrahi wrote:

i have 3mb long XML document with about 150000 lines (i think it has about
200000 elements there) which i want to parse to DOM to work with.
first i thought there will be no problems, but there were..
first i tried Python.. there's special interest group that wants to "make
Python become the premier language for XML processing" so i thought there
will be no problems with this document..
i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
it - i don't want such processing.. i think this is because of that fat
class impementation - possibly Python had some significant overhead for each
object instance, or something like this..

Have you tried ElementTree?

http://effbot.org/zone/element-index.htm

HTH,

--
Hans Nowak (ha**@zephyrfal con.org)
http://zephyrfalcon.org/

Jul 18 '05 #2

Alex Mizrahi

(message (Hello 'Hans)
(you :wrote :on '(Sun, 11 Jul 2004 21:32:11 -0400))
(

i have 3mb long XML document with about 150000 lines (i think it has
about 200000 elements there) which i want to parse to DOM to work
with.
first i thought there will be no problems, but there were..
first i tried Python.. there's special interest group that wants to
"make
Python become the premier language for XML processing" so i thought
there will be no problems with this document..
i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i
killed it - i don't want such processing.. i think this is because of
that fat class impementation - possibly Python had some significant
overhead for each object instance, or something like this..

HN> Have you tried ElementTree?

no..
just tried it - it eats 133 megs and parses for quite a long time, however
it works..
i'll consider using it because processing xml in c++ appears to be pain in
ass..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
(prin1 "Jane dates only Lisp programmers"))

Jul 18 '05 #3

Peter Hansen

Alex Mizrahi wrote:

i have 3mb long XML document with about 150000 lines (i think it has about
200000 elements there) which i want to parse to DOM to work with.

Often, problems with performance come down the using the
wrong algorithm, or using the wrong architecture for the
problem at hand.

Are you absolutely certain that using a full in-memory DOM
representation is the best for your problem? It seems
very unlikely to me that it really is...

For example, there are approaches which can read in the
document incrementally (and I'm not just talking SAX here),
rather than read the whole thing at once.

I found your analysis fairly simplistic, on the whole...

-Peter

Jul 18 '05 #4

Paul Rubin

Peter Hansen <pe***@engcorp. com> writes:

For example, there are approaches which can read in the
document incrementally (and I'm not just talking SAX here),
rather than read the whole thing at once.

Rather than either reading incrementally or else slurping in the
entire document in many-noded glory, I wonder if anyone's implemented
a parser that scans over the XML doc and makes a compact sequential
representation of the tree structure, and then provides access methods
that let you traverse the tree as if it were a real DOM, by fetching
the appropriate strings from the (probably mmap'ed) disk file as you
walk around in the tree.

Jul 18 '05 #5

John Lenton

On Mon, 12 Jul 2004 03:19:03 +0300, Alex Mizrahi <ud******@hotma il.com> wrote:

Hello, All!

i have 3mb long XML document with about 150000 lines (i think it has about
200000 elements there) which i want to parse to DOM to work with.
first i thought there will be no problems, but there were..
first i tried Python.. there's special interest group that wants to "make
Python become the premier language for XML processing" so i thought there
will be no problems with this document..
i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
it - i don't want such processing.. i think this is because of that fat
class impementation - possibly Python had some significant overhead for each
object instance, or something like this..

in my experience xml.dom.minidom is a hog; there are several other DOM
and tree-building (i.e., DOMish) parsers out there, most of them
better from a performance point of view; my own personal favourite is
libxml2, but google for the issue and you'll even come across people
who have compared the different things.

--
John Lenton (jl*****@gmail. com) -- Random fortune:
bash: fortune: command not found

Jul 18 '05 #6

Tim Bradshaw

Paul Rubin <http://ph****@NOSPAM.i nvalid> wrote in message news:<7x******* *****@ruckus.br ouhaha.com>...

Rather than either reading incrementally or else slurping in the
entire document in many-noded glory, I wonder if anyone's implemented
a parser that scans over the XML doc and makes a compact sequential
representation of the tree structure, and then provides access methods
that let you traverse the tree as if it were a real DOM, by fetching
the appropriate strings from the (probably mmap'ed) disk file as you
walk around in the tree.

I dunno if this has been done recently, but this is the sort of thing
that people used to do for very large SGML documents. I forget the
details, but I remember things that were some hundreds of MB (parsed
or unparsed I'm not sure) which would be written out in some parsed
form into large files, which could then be manipulated as if the whole
object was there. Of course no one would care about a few hundred MB
of memory now, but they did then (this was 91-92 I think).

I had a theory of doing this all lazily, so you wouldn't have to do
the (slow) parsing step up-front but would just lie and say `OK, I
parsed it', then actually doing the work only on demand.

--tim

Jul 18 '05 #7

Alex Mizrahi

(message (Hello 'Peter)
(you :wrote :on '(Sun, 11 Jul 2004 22:15:50 -0400))
(

i have 3mb long XML document with about 150000 lines (i think it has
about 200000 elements there) which i want to parse to DOM to work
with.

PH> Often, problems with performance come down the using the wrong
PH> algorithm, or using the wrong architecture for the problem at hand.

i see nothing wrong in loading 3 mb data into RAM. however, implementation
details made it 100 times larger and it was the problem..

PH> Are you absolutely certain that using a full in-memory DOM
PH> representation is the best for your problem? It seems very unlikely
PH> to me that it really is...

format i'm dealing with is quite chaotic and i'm going to work with it
interactively - track down myself where data i need lie and see how can i
extract data..
it's only a small part of task and it's needed only temporarily, so i don't
need best thing possible - i need something that just works..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
(prin1 "Jane dates only Lisp programmers"))

Jul 18 '05 #8

Alex Mizrahi

(message (Hello 'Paul)
(you :wrote :on '(11 Jul 2004 20:13:42 -0700))
(

For example, there are approaches which can read in the document
incrementally (and I'm not just talking SAX here), rather than read
the whole thing at once.

PR> Rather than either reading incrementally or else slurping in the
PR> entire document in many-noded glory, I wonder if anyone's
PR> implemented a parser that scans over the XML doc and makes a compact
PR> sequential representation of the tree structure, and then provides
PR> access methods that let you traverse the tree as if it were a real
PR> DOM, by fetching the appropriate strings from the (probably mmap'ed)
PR> disk file as you walk around in the tree.

that would be nice.. i remember i've did something like this for one binary
chunky format - thingie avoided allocating new memory as long as possible..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
(prin1 "Jane dates only Lisp programmers"))

Jul 18 '05 #9

Richie Hindle

[Paul]

Rather than either reading incrementally or else slurping in the
entire document in many-noded glory, I wonder if anyone's implemented
a parser that scans over the XML doc and makes a compact sequential
representation of the tree structure, and then provides access methods
that let you traverse the tree as if it were a real DOM, by fetching
the appropriate strings from the (probably mmap'ed) disk file as you
walk around in the tree.

It's not exactly what you describe here, but xml.dom.pulldom is roughly
this. You access the higher-level nodes by SAX, thus not using massive
amounts of memory, but you can access the children of those higher-level
elements using DOM. The canonical example is processing a large number of
XML records - the XML document is arbitrarily large but the individual
records aren't. Pulldom passes each record to you SAX-style, and you use
DOM to process the record.

Uche Ogbuji has a short article on xml.dom.pulldom here:
http://www-106.ibm.com/developerwork...tipulldom.html

Here's the example from that article. Line 16 is the key to it - that's the
point at which you switch from SAX to DOM:

1 #Get the first line in Act IV, scene II
2
3 from xml.dom import pulldom
4
5 hamlet_file = open("hamlet.xm l")
6
7 events = pulldom.parse(h amlet_file)
8 act_counter = 0
9 for (event, node) in events:
10 if event == pulldom.START_E LEMENT:
11 if node.tagName == "ACT":
12 act_counter += 1
13 scene_counter = 1
14 if node.tagName == "SCENE":
15 if act_counter == 4 and scene_counter == 2:
16 events.expandNo de(node)
17 #Traditional DOM processing starts here
18 #Get all descendant elements named "LINE"
19 line_nodes = node.getElement sByTagName("LIN E")
20 #Print the text data of the text node
21 #of the first LINE element
22 print line_nodes[0].firstChild.dat a
23 scene_counter += 1
--
Richie Hindle
ri****@entrian. com

Jul 18 '05 #10

Similar topics

8017

Is anything easier to do in java than in lisp?

by: RobertMaas | last post by:

After many years of using LISP, I'm taking a class in Java and finding the two roughly comparable in some ways and very different in other ways. Each has a decent size library of useful utilities as a standard portable part of the core language, the LISP package, and the java.lang package, respectively. Both have big integers, although only LISP has rationals as far as I can tell. Because CL supports keyword arguments, it has a wider range...

Java

699

33941

Python syntax in Lisp and Scheme

by: mike420 | last post by:

I think everyone who used Python will agree that its syntax is the best thing going for it. It is very readable and easy for everyone to learn. But, Python does not a have very good macro capabilities, unfortunately. I'd like to know if it may be possible to add a powerful macro system to Python, while keeping its amazing syntax, and if it could be possible to add Pythonistic syntax to Lisp or Scheme, while keeping all of the...

Python

303

17637

BIG successes of Lisp (was ...)

by: mike420 | last post by:

In the context of LATEX, some Pythonista asked what the big successes of Lisp were. I think there were at least three *big* successes. a. orbitz.com web site uses Lisp for algorithms, etc. b. Yahoo store was originally written in Lisp. c. Emacs The issues with these will probably come up, so I might as well mention them myself (which will also make this a more balanced

Python

2673

OFF-TOPIC:: Why Lisp is not my favorite programming language

by: nobody | last post by:

This article is posted at the request of C.W. Yang who asked me to detail my opinion of Lisp, and for the benefit of people like him, who may find themselves intrigued by this language. The opinions expressed herein are my personal ones, coming from several years of experience with Lisp. I did plenty of AI programming back in the day, which is what would now be called "search" instead.

Python

1745

Determine winner from two distinct lines

by: John Grenier | last post by:

Hi, I have to determine the "standing" (WIN - TIE - LOSS) from confrontations between two teams on a contest. The table matchResults has fields cont_id, team_id and contest_result (int). TABLE matchResults cont_id team_id contest_result 1 1 3 1 2 5

Microsoft SQL Server

852

28406

merits of Lisp vs Python

by: Mark Tarver | last post by:

How do you compare Python to Lisp? What specific advantages do you think that one has over the other? Note I'm not a Python person and I have no axes to grind here. This is just a question for my general education. Mark

Python

8678

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8899

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8871

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7737

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6525

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5861

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4621

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2333

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2007

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General