SAX/Python : read an xml from the end to the top

kepioo

I currently have an xml input file containing lots of data. My objectiv
is to write a script that reports in another xml file only the data I
am interested in. Doing this is really easy using SAX.

The input file is continuously updated. However, the other xml file
should be updated only on request.

Everytime we run the script, we track the new elements in the input
file and report them in the output file.

My idea was to :
_ detect in the output file the last event reported
_ read the input file from the end
_ report all the new events ( since the last time the script was run).

Question : IS it possible to read an XML file and process it from the
end to the beginning, using SAX????

Mar 7 '06 #1

Subscribe Post Reply

1863

Diez B. Roggisch

kepioo schrieb:

I currently have an xml input file containing lots of data. My objectiv
is to write a script that reports in another xml file only the data I
am interested in. Doing this is really easy using SAX.

The input file is continuously updated. However, the other xml file
should be updated only on request.

Everytime we run the script, we track the new elements in the input
file and report them in the output file.

My idea was to :
_ detect in the output file the last event reported
_ read the input file from the end
_ report all the new events ( since the last time the script was run).

Question : IS it possible to read an XML file and process it from the
end to the beginning, using SAX????

No. And in no other XML-related technology I know of.

Generally speaking, I'd say your approach is inherently flawed. XML as a language requires well-formed documents to have
exactly one root element. This makes it unsuitable
for e.g. logging-files, as these have no explicit "end" - except the implicit last log-entry. So you will always have
something like this:

--- begin ---
<root>
<entry/>
<entry/>
--- end ---

I don't know _what_ you do, but unless you always write the whole XML-file completely new, you can't possibly write that
closing end-tag. So you end up with an malformed xml-document. Or you _do_ write all the file contents new each time -
but then you'd be able to reverse the order of elements so that the last came first. But I doubt the latter, as it
imposes a great performance-bottleneck with little gain.

SAX won't puke on you for your file being malformed, as it only learns about that when it is to late. So - you might use
it, as when that happens you are already finished with your actual task.

But you will always have to parse it from the beginning, to catch the document header, and there is no fast-forward
build into SAX.

So - what are your options?

- use seperate output files for each entry, that are well-formed in themselves. Beware if you've got plenty of them
(few K to M) that some FS might not deal well with that

- if you can keep the file open reading all the time (because you are kind of a background process), you can read the
contents, create a buffer and search for start-tags in that yourself. Then you can snip out the necessary portions,
complete them with a xml-header and feed them separately.

- if you can't keep it open, you can simulate that using the seed-function

Both the last options are somewhat cumbersome, as you have to do a lot of parsing yourself - the exact purpose one chose
XML the first time... From that follows the last advice:

- ditch XML. Either totally, or at least as format for the whole file. Instead, use some protocol like this:

--- begin ---
Chunk-Length: 100
<?xml version="1.0"?>
<root>... ( a 100 byte size xml document)
</root>
Chunk-Length: 200
<?xml version="1.0"?>
<root>... ( a 200 byte size xml document)
</root>
....

Then you can easily read through your document, skip unnecessary entries and extract the ones you want. Or, when keeping
the file open, know exactly what to read for the next chunk.

Diez

Mar 7 '06 #2

kepioo

Hi Diez,

thank you for your answer. Let me give you more background on the
project.

The input xml I am parsing is always well formed. It is coming out from
another application that append to this xml. I didn't see the source
code of the application, but i know that it is not re-writing the whole
xml. I thinnk it is just removing the last root element, adding the new
tags and writing again the </root> tag.

We don't want to create new output files for every entry ( each entry
is an event, and we have approximativaly 5 events per minute). So I
have to stick with this xml input file.

I guess, i will parse it till I find the last reported event and update
the output xml from there, reporting only the events I am interested
in....I hope SAX won't take too much time to do all this...(let's say 1
event = 10 tags, 5 events/minutes, xml file running for 1 month -->
5400 000 opening tags)...

What do you think?

Mar 7 '06 #3

Diez B. Roggisch

> We don't want to create new output files for every entry ( each entry

is an event, and we have approximativaly 5 events per minute). So I
have to stick with this xml input file.
Well, the overall amount of data won't change. But I can understand that
decision. However, you might consider using a file per day/week.
I guess, i will parse it till I find the last reported event and update
the output xml from there, reporting only the events I am interested
in....I hope SAX won't take too much time to do all this...(let's say 1
event = 10 tags, 5 events/minutes, xml file running for 1 month -->
5400 000 opening tags)...

Use my suggested approach 2 - that boils down to using "seek" and some
hand-written parsing/buffering. A little bit nasty, but better than
consuming all of that file through sax.

Diez

Mar 7 '06 #4

Peter Hansen

kepioo wrote:

The input xml I am parsing is always well formed. It is coming out from
another application that append to this xml. I didn't see the source
code of the application, but i know that it is not re-writing the whole
xml. I thinnk it is just removing the last root element, adding the new
tags and writing again the </root> tag.
If the writers had a clue, they probably just seek to the end of the
file minus len('</root>') (or whatever) and then overwrite with the new
entry and another </root> element. At least, that's what seemed like
the obvious approach when I had to do this once.

Not that this is particularly relevant to the problem. ;-)
I guess, i will parse it till I find the last reported event and update
the output xml from there, reporting only the events I am interested
in....I hope SAX won't take too much time to do all this...(let's say 1
event = 10 tags, 5 events/minutes, xml file running for 1 month -->
5400 000 opening tags)...

What do you think?

I think (guessing wildly) you probably have a fairly restricted number
of possibilities being written to this file, possibly as simple as the
somewhat stereotypical '<entry text="blah blah"/>' type of thing which
I've seen lots of times.

If so, you can simply treat this as a text file which you process
manually, in whatever direct and crude fashion works best, such as by
seeking 1000 chars back from the end (assuming new entries are always
less than that length), scanning for the last "<entry" string, and
slicing and dicing till you find the stuff you need.

In other words, screw SAX, just grab the data directly and forget about
all those silly well-formed XML issues etc. Go for the simplest thing
that could possibly work, and if you don't need the complexity of SAX,
don't use it.

-Peter

Mar 7 '06 #5

kepioo

Thanks Diez for your suggestion, I'll look around to find out more
about the seek function ( i learnt python 2 weeks ago and I do not have
a programmer background, but so far, I am doing well).

Peter,

I cannot really process as your advice : it is not that stereotypical
entries....we built a data structure for the xml and we report various
types of events, always under the same format but with different
contents types.

The script i am writing aims at picking only special events (
identified by a route tag and an information tag).

Anyway, thank you for your advices!!

Mar 7 '06 #6

Peter Hansen

kepioo wrote:

Peter,

I cannot really process as your advice : it is not that stereotypical
entries....we built a data structure for the xml and we report various
types of events, always under the same format but with different
contents types.

The script i am writing aims at picking only special events (
identified by a route tag and an information tag).

Can you post one or two small examples that show the range of
possibilities? I still have this feeling there will be a simpler
approach than really parsing the XML, but maybe I'm wrong.

-Peter

Mar 7 '06 #7

kepioo

An example ( i changed the content to make it easier) :

################### input file ####################3

<root>
<case>
<TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">5</Element>
<Element name="banana">10</Element>
<Element name="peach">25</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:45:28 SGT 2006" >
<Message>names</Message>
<Elements>
<Element name="CEO">vincent</Element>
<Element name="Analysit">Robert</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:50:28 SGT 2006" >
<Message>open the car</Message>
</TimeStamp>
</case>
<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="peach">25</Element>
<Element name="apple">8</Element>
<Element name="cherry">120</Element>
</Elements>
</TimeStamp>
</case>
</root>
##############################################3

The script I want to write has to track any change in the input
file(what we want to track are parameters in the script. Here for
instance, the number of apple and cherry). The ouput file for this
example would be ( we write it as a stream):

################### OutPut file #################################
<track>
<case>
<TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">5</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">8</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="cherry">120</Element>
</Elements>
</TimeStamp>
</case>
</track>
############################################33333
The input file keeps being generated. The ouput file is generated on
request. Both are streamed based : we happend to the end of the file.

Mar 8 '06 #8

kepioo

Mar 8 '06 #9

Similar topics

valgrind python 2.3.4

by: Jerald | last post by:

Running python 2.3.4 on valgrind (a tool like purify which checks the use of uninitialized memory, etc), gives a lot of errors. See below. jfj@cluster:~/> python -V Python 2.3.4...

Python

Python or PHP?

by: Lad | last post by:

Is anyone capable of providing Python advantages over PHP if there are any? Cheers, L.

Python

Python doc problem example: gzip module (reprise)

by: Xah Lee | last post by:

Python Doc Problem Example: gzip Xah Lee, 20050831 Today i need to use Python to compress/decompress gzip files. Since i've read the official Python tutorial 8 months ago, have spent 30...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math