472,117 Members | 2,170 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,117 software developers and data experts.

xml + mmap cross

Any interest in pursuing/developing/working together on a mmaped-xml
class? Faster, not readable in text editor.
Sep 3 '08 #1
6 1581
castironpi wrote:
Any interest in pursuing/developing/working together on a mmaped-xml
class? Faster, not readable in text editor.
Any hints on what you are talking about?

Stefan
Sep 4 '08 #2
On Sep 4, 2:14*am, Stefan Behnel <stefan...@behnel.dewrote:
castironpi wrote:
Any interest in pursuing/developing/working together on a mmaped-xml
class? *Faster, not readable in text editor.

Any hints on what you are talking about?

Stefan
Nice to hear from you. I assumed you were familiar with the problem;
you're not.

In an XML file, entries are stored in serial, sort of like this.

AAA BBB CCC DDD

Or more recognizably,

<A><B><C>something</C><D>something</D></B></A>

Point is, to change <C>something</Cto <C>something else</C>, you
have to recopy everything after that.

AAA BBB CCC DDD
AAA BBBb CCC DDD

requires 7 writes, 'b CCC DDD', not 1.

I want to use a simple tree structure to store:

0 A-None, 1
1 B-None, 2
2 C-3, None
3 D-None, None

Each node maps to 'Next, Child', or more accurately, 'Next Sibling,
First Child'.

You get constant time updates to contents, and log-time searches.

There was a similar problem today in:

From: Gerhard Häring <g...@ghaering.de>
Date: Thu, 04 Sep 2008 13:08:59 +0200
Subject: Re: cPickle

The OP wanted to update the third element in a pickled tuple, but not
the first two.

I propose to write a tree structure to a memory-mapped file. A
heavyweight string class, Rope, I wrote, exceeded native string speeds
at a file size of two megs. You could use that, or store the tree
directly.

The obstacle is probably mmap 'alloc' and 'free' routines, which I
posted on Google Code.
Sep 4 '08 #3
On Sep 4, 8:31*am, castironpi <castiro...@gmail.comwrote:
Any interest in pursuing/developing/working together on a mmaped-xml
class? *Faster, not readable in text editor.
XML is text-based, so it should -always- be readable in a text editor.
It's part of the definition, I believe.

However, an implementation of one of the alternative binary XML
formats would probably be very welcome.

Fast Infoset: http://www.itu.int/rec/T-REC-X.891-200505-I/en
EXI: http://www.w3.org/TR/2007/WD-exi-20070716/

I don't know enough about either format to say if it would be
possible, but an implementation that conformed to the ElementTree API
could be a big win.
Sep 5 '08 #4
On Sep 4, 7:54*pm, alex23 <wuwe...@gmail.comwrote:
On Sep 4, 8:31*am, castironpi <castiro...@gmail.comwrote:
Any interest in pursuing/developing/working together on a mmaped-xml
class? *Faster, not readable in text editor.

XML is text-based, so it should -always- be readable in a text editor.
It's part of the definition, I believe.

However, an implementation of one of the alternative binary XML
formats would probably be very welcome.

Fast Infoset:http://www.itu.int/rec/T-REC-X.891-200505-I/en
EXI:http://www.w3.org/TR/2007/WD-exi-20070716/

I don't know enough about either format to say if it would be
possible, but an implementation that conformed to the ElementTree API
could be a big win.
I was thinking something much less restrictive than the two links.
Since it's not text, I'm not sure it event counts as structured
markup. More generic, something like hierarchical 'tag-content-child'
pairs.

Here's what the xml.etree.ElementTree API says:

Each element has a number of properties associated with it:

- a tag which is a string identifying what kind of data this element
represents (the element type, in other words).
- a number of attributes, stored in a Python dictionary.
- a text string.
- an optional tail string.
- a number of child elements, stored in a Python sequence

Since all of these would be buffer-based representations, the
attribute list would merely implement the mapping-object protocol, not
be in a true dictionary. The strings would be stored as offsets to
length-prefixed buffer segments.

Each node would look roughly like:
tag_offset, first_attr, text_offset, tail_offset, first_child,
prev_sibling, next_sibling, parent

Attributes would look like:
key_offset, value_offset, prev_attr, next_attr, node

These are all integers representing offsets elsewhere into the map.

A short observation:
>>a= e.XML( '<a><b>abc</b></a>' )
a.getchildren()[0].text
'abc'
>>a.getchildren()[0].text= 'ab<'
e.tostring(a)
'<a><b>ab&lt;</b></a>'
>>e.XML(_)
<Element a at c2c3f0>
>>_.getchildren()[0].text
'ab<'

The current implementation supports round trips between special
characters '<' and markup '&lt;', which I propose to support as well.

Of course, you'd have to garbage collect removed nodes by hand, on any
deletions.

Also, poss. change subject to: ElementTree + mmap cross.
Sep 5 '08 #5
Hi,

this discussion seems pretty much off-topic for a Python list.

castironpi wrote:
In an XML file, entries are stored in serial, sort of like this.

AAA BBB CCC DDD

Or more recognizably,

<A><B><C>something</C><D>something</D></B></A>

Point is, to change <C>something</Cto <C>something else</C>, you
have to recopy everything after that.

AAA BBB CCC DDD
AAA BBBb CCC DDD

requires 7 writes, 'b CCC DDD', not 1.

I want to use a simple tree structure to store:

0 A-None, 1
1 B-None, 2
2 C-3, None
3 D-None, None

Each node maps to 'Next, Child', or more accurately, 'Next Sibling,
First Child'.
Do I understand you right: you want to access serialised XML data as a memory
mapped file and operate directly on the byte data? You would still have to
decode the complete byte sequence to parse it and build your index structure
in that case.

How do you plan to store the pointers to a node's next sibling/child? And how
do you keep them updated over insertions/deletions? That, plus the copy
overhead in a sequential file, will be very costly on each change.

You get constant time updates to contents, and log-time searches.
Every XML tree structure gives you log-time searches. But how do you achieve
constant time updates in a sequential file?

Stefan
Sep 5 '08 #6
On Sep 5, 12:13*am, Stefan Behnel <stefan...@behnel.dewrote:
Hi,

this discussion seems pretty much off-topic for a Python list.

castironpi wrote:
In an XML file, entries are stored in serial, sort of like this.
AAA BBB CCC DDD
Or more recognizably,
<A><B><C>something</C><D>something</D></B></A>
Point is, to change <C>something</Cto <C>something else</C>, you
have to recopy everything after that.
AAA BBB CCC DDD
AAA BBBb CCC DDD
requires 7 writes, 'b CCC DDD', not 1.
I want to use a simple tree structure to store:
0 A-None, 1
1 B-None, 2
2 C-3, None
3 D-None, None
Each node maps to 'Next, Child', or more accurately, 'Next Sibling,
First Child'.

Do I understand you right: you want to access serialised XML data
alex23 correctly pointed out last night that XML is always
serialized. What I mean and possibly what you mean is, a serialized
tree structure.
as a memory
mapped file and operate directly on the byte data?
Yes. Byte data containing both strings, and the structure of the
tree.
You would still have to
decode the complete byte sequence to parse it and build your index structure
in that case.
No, it's saved in an index structure; it's already in one. To find a/
b/c, read a, read its children to b, read b, read its children to c,
read c.
How do you plan to store the pointers to a node's next sibling/child? Andhow
do you keep them updated over insertions/deletions?
You update them when you insert them. To add 'c' to a/b, allocate c,
initialize c, read a, read its children to b, read b, read its
children to the end, add c.

When you delete c from a/b/c, however, any references to c that you
have in any programs become invalid. Don't delete it if you have
them. The bytes are marked in the file to be no longer in use, which
marking takes up some of the space in the file.
That, plus the copy
overhead in a sequential file, will be very costly on each change.
No. That's the point of a dynamic structure. <abc><defare not
stored in memory as 10 consecutive characters. The file is not
strictly speaking XML. See below for what they're stored as.
You get constant time updates to contents, and log-time searches.

Every XML tree structure gives you log-time searches. But how do you achieve
constant time updates in a sequential file?
You don't use a sequential file.
Stefan
I stated earlier that each node would look roughly like:
tag_offset, first_attr, text_offset, tail_offset, first_child,
prev_sibling, next_sibling, parent

Attributes would look like:
key_offset, value_offset, prev_attr, next_attr, node

All these fields are integers that contain an offset into the file.

Simplified:

0 Reserved
1 A.Tag 7 (points to 7 below)
2 A.FirstChild 4 (etc.)
3 A.Contents 0 (None)
4 B.Tag 11
5 B.FirstChild 0
6 B.Contents 15
7 3abc
11 3def
15 5ghijk

This translates to:

<abc><def>ghijk</def></abc>

But isn't stored that way.

The records are all the same size. The 'Tag' field of a record that
starts at offset J is at offset J. The 'FirstChild' field of a record
that starts at offset K is at offset K+ 1.

'tag_offset', 'text_offset', 'key_offset', and 'value_offset' contain
offsets of variable-length strings into the file ( 7, 11, 15 ). The
rest contain offsets of further structures.

There's extra space usage not only in the structure of the tree, but
in the record-keeping of what bytes are available for use, and which
are in use. You have to grow the file size yourself (like growing an
array) when you need to; the file system won't for you in mmap. (This
means the alloc-free module I'm looking at will need modifications.)

I'll reemphasize the value of constant-time insertions to a file
though.
Sep 5 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by junky_fellow | last post: by
4 posts views Thread by Hao Xu | last post: by
4 posts views Thread by Fabiano Sidler | last post: by
2 posts views Thread by beejisbrigit | last post: by
1 post views Thread by James T. Dennis | last post: by
1 post views Thread by koara | last post: by
2 posts views Thread by Neal Becker | last post: by
reply views Thread by Kris Kennaway | last post: by
reply views Thread by Gabriel Genellina | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.