473,398 Members | 2,380 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

xml + mmap cross

Any interest in pursuing/developing/working together on a mmaped-xml
class? Faster, not readable in text editor.
Sep 3 '08 #1
6 1645
castironpi wrote:
Any interest in pursuing/developing/working together on a mmaped-xml
class? Faster, not readable in text editor.
Any hints on what you are talking about?

Stefan
Sep 4 '08 #2
On Sep 4, 2:14*am, Stefan Behnel <stefan...@behnel.dewrote:
castironpi wrote:
Any interest in pursuing/developing/working together on a mmaped-xml
class? *Faster, not readable in text editor.

Any hints on what you are talking about?

Stefan
Nice to hear from you. I assumed you were familiar with the problem;
you're not.

In an XML file, entries are stored in serial, sort of like this.

AAA BBB CCC DDD

Or more recognizably,

<A><B><C>something</C><D>something</D></B></A>

Point is, to change <C>something</Cto <C>something else</C>, you
have to recopy everything after that.

AAA BBB CCC DDD
AAA BBBb CCC DDD

requires 7 writes, 'b CCC DDD', not 1.

I want to use a simple tree structure to store:

0 A-None, 1
1 B-None, 2
2 C-3, None
3 D-None, None

Each node maps to 'Next, Child', or more accurately, 'Next Sibling,
First Child'.

You get constant time updates to contents, and log-time searches.

There was a similar problem today in:

From: Gerhard Häring <g...@ghaering.de>
Date: Thu, 04 Sep 2008 13:08:59 +0200
Subject: Re: cPickle

The OP wanted to update the third element in a pickled tuple, but not
the first two.

I propose to write a tree structure to a memory-mapped file. A
heavyweight string class, Rope, I wrote, exceeded native string speeds
at a file size of two megs. You could use that, or store the tree
directly.

The obstacle is probably mmap 'alloc' and 'free' routines, which I
posted on Google Code.
Sep 4 '08 #3
On Sep 4, 8:31*am, castironpi <castiro...@gmail.comwrote:
Any interest in pursuing/developing/working together on a mmaped-xml
class? *Faster, not readable in text editor.
XML is text-based, so it should -always- be readable in a text editor.
It's part of the definition, I believe.

However, an implementation of one of the alternative binary XML
formats would probably be very welcome.

Fast Infoset: http://www.itu.int/rec/T-REC-X.891-200505-I/en
EXI: http://www.w3.org/TR/2007/WD-exi-20070716/

I don't know enough about either format to say if it would be
possible, but an implementation that conformed to the ElementTree API
could be a big win.
Sep 5 '08 #4
On Sep 4, 7:54*pm, alex23 <wuwe...@gmail.comwrote:
On Sep 4, 8:31*am, castironpi <castiro...@gmail.comwrote:
Any interest in pursuing/developing/working together on a mmaped-xml
class? *Faster, not readable in text editor.

XML is text-based, so it should -always- be readable in a text editor.
It's part of the definition, I believe.

However, an implementation of one of the alternative binary XML
formats would probably be very welcome.

Fast Infoset:http://www.itu.int/rec/T-REC-X.891-200505-I/en
EXI:http://www.w3.org/TR/2007/WD-exi-20070716/

I don't know enough about either format to say if it would be
possible, but an implementation that conformed to the ElementTree API
could be a big win.
I was thinking something much less restrictive than the two links.
Since it's not text, I'm not sure it event counts as structured
markup. More generic, something like hierarchical 'tag-content-child'
pairs.

Here's what the xml.etree.ElementTree API says:

Each element has a number of properties associated with it:

- a tag which is a string identifying what kind of data this element
represents (the element type, in other words).
- a number of attributes, stored in a Python dictionary.
- a text string.
- an optional tail string.
- a number of child elements, stored in a Python sequence

Since all of these would be buffer-based representations, the
attribute list would merely implement the mapping-object protocol, not
be in a true dictionary. The strings would be stored as offsets to
length-prefixed buffer segments.

Each node would look roughly like:
tag_offset, first_attr, text_offset, tail_offset, first_child,
prev_sibling, next_sibling, parent

Attributes would look like:
key_offset, value_offset, prev_attr, next_attr, node

These are all integers representing offsets elsewhere into the map.

A short observation:
>>a= e.XML( '<a><b>abc</b></a>' )
a.getchildren()[0].text
'abc'
>>a.getchildren()[0].text= 'ab<'
e.tostring(a)
'<a><b>ab&lt;</b></a>'
>>e.XML(_)
<Element a at c2c3f0>
>>_.getchildren()[0].text
'ab<'

The current implementation supports round trips between special
characters '<' and markup '&lt;', which I propose to support as well.

Of course, you'd have to garbage collect removed nodes by hand, on any
deletions.

Also, poss. change subject to: ElementTree + mmap cross.
Sep 5 '08 #5
Hi,

this discussion seems pretty much off-topic for a Python list.

castironpi wrote:
In an XML file, entries are stored in serial, sort of like this.

AAA BBB CCC DDD

Or more recognizably,

<A><B><C>something</C><D>something</D></B></A>

Point is, to change <C>something</Cto <C>something else</C>, you
have to recopy everything after that.

AAA BBB CCC DDD
AAA BBBb CCC DDD

requires 7 writes, 'b CCC DDD', not 1.

I want to use a simple tree structure to store:

0 A-None, 1
1 B-None, 2
2 C-3, None
3 D-None, None

Each node maps to 'Next, Child', or more accurately, 'Next Sibling,
First Child'.
Do I understand you right: you want to access serialised XML data as a memory
mapped file and operate directly on the byte data? You would still have to
decode the complete byte sequence to parse it and build your index structure
in that case.

How do you plan to store the pointers to a node's next sibling/child? And how
do you keep them updated over insertions/deletions? That, plus the copy
overhead in a sequential file, will be very costly on each change.

You get constant time updates to contents, and log-time searches.
Every XML tree structure gives you log-time searches. But how do you achieve
constant time updates in a sequential file?

Stefan
Sep 5 '08 #6
On Sep 5, 12:13*am, Stefan Behnel <stefan...@behnel.dewrote:
Hi,

this discussion seems pretty much off-topic for a Python list.

castironpi wrote:
In an XML file, entries are stored in serial, sort of like this.
AAA BBB CCC DDD
Or more recognizably,
<A><B><C>something</C><D>something</D></B></A>
Point is, to change <C>something</Cto <C>something else</C>, you
have to recopy everything after that.
AAA BBB CCC DDD
AAA BBBb CCC DDD
requires 7 writes, 'b CCC DDD', not 1.
I want to use a simple tree structure to store:
0 A-None, 1
1 B-None, 2
2 C-3, None
3 D-None, None
Each node maps to 'Next, Child', or more accurately, 'Next Sibling,
First Child'.

Do I understand you right: you want to access serialised XML data
alex23 correctly pointed out last night that XML is always
serialized. What I mean and possibly what you mean is, a serialized
tree structure.
as a memory
mapped file and operate directly on the byte data?
Yes. Byte data containing both strings, and the structure of the
tree.
You would still have to
decode the complete byte sequence to parse it and build your index structure
in that case.
No, it's saved in an index structure; it's already in one. To find a/
b/c, read a, read its children to b, read b, read its children to c,
read c.
How do you plan to store the pointers to a node's next sibling/child? Andhow
do you keep them updated over insertions/deletions?
You update them when you insert them. To add 'c' to a/b, allocate c,
initialize c, read a, read its children to b, read b, read its
children to the end, add c.

When you delete c from a/b/c, however, any references to c that you
have in any programs become invalid. Don't delete it if you have
them. The bytes are marked in the file to be no longer in use, which
marking takes up some of the space in the file.
That, plus the copy
overhead in a sequential file, will be very costly on each change.
No. That's the point of a dynamic structure. <abc><defare not
stored in memory as 10 consecutive characters. The file is not
strictly speaking XML. See below for what they're stored as.
You get constant time updates to contents, and log-time searches.

Every XML tree structure gives you log-time searches. But how do you achieve
constant time updates in a sequential file?
You don't use a sequential file.
Stefan
I stated earlier that each node would look roughly like:
tag_offset, first_attr, text_offset, tail_offset, first_child,
prev_sibling, next_sibling, parent

Attributes would look like:
key_offset, value_offset, prev_attr, next_attr, node

All these fields are integers that contain an offset into the file.

Simplified:

0 Reserved
1 A.Tag 7 (points to 7 below)
2 A.FirstChild 4 (etc.)
3 A.Contents 0 (None)
4 B.Tag 11
5 B.FirstChild 0
6 B.Contents 15
7 3abc
11 3def
15 5ghijk

This translates to:

<abc><def>ghijk</def></abc>

But isn't stored that way.

The records are all the same size. The 'Tag' field of a record that
starts at offset J is at offset J. The 'FirstChild' field of a record
that starts at offset K is at offset K+ 1.

'tag_offset', 'text_offset', 'key_offset', and 'value_offset' contain
offsets of variable-length strings into the file ( 7, 11, 15 ). The
rest contain offsets of further structures.

There's extra space usage not only in the structure of the tree, but
in the record-keeping of what bytes are available for use, and which
are in use. You have to grow the file size yourself (like growing an
array) when you need to; the file system won't for you in mmap. (This
means the alloc-free module I'm looking at will need modifications.)

I'll reemphasize the value of constant-time insertions to a file
though.
Sep 5 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: junky_fellow | last post by:
Consider a scenario, where one process uses mmap() to access/modify a particaular file and at the same time other process uses read/write to access/modify the same file. I wanted to know whether...
4
by: Hao Xu | last post by:
Hi everyone! I found that if you want to write to the memory got by mmap(), you have to get the file descriptor for mmap() in O_RDWR mode. If you got the file descriptor in O_WRONLY mode, then...
4
by: Fabiano Sidler | last post by:
Hi folks! I created an mmap object like so: --- snip --- from mmap import mmap,MAP_ANONYMOUS,MAP_PRIVATE fl = file('/dev/zero','rw') mm = mmap(fl.fileno(), 1, MAP_PRIVATE|MAP_ANONYMOUS) ---...
2
by: beejisbrigit | last post by:
Hi there, I was wondering if anyone had experience with File I/O in Java vs. C++ using mmap(), and knew if the performance was better in one that the other, or more or less negligible. My...
1
by: James T. Dennis | last post by:
I've been thinking about the Python mmap module quite a bit during the last couple of days. Sadly most of it has just been thinking ... and reading pages from Google searches ... and very little...
1
by: koara | last post by:
Hello all, i am using the mmap module (python2.4) to access contents of a file. My question regards the relative performance of mmap.seek() vs mmap.tell(). I have a generator that returns...
2
by: Neal Becker | last post by:
On linux, I don't understand why: f = open ('/dev/eos', 'rw') m = mmap.mmap(f.fileno(), 1000000, prot=mmap.PROT_READ|mmap.PROT_WRITE, flags=mmap.MAP_SHARED) gives 'permission denied', but...
0
by: Kris Kennaway | last post by:
If I do the following: def mmap_search(f, string): fh = file(f) mm = mmap.mmap(fh.fileno(), 0, mmap.MAP_SHARED, mmap.PROT_READ) return mm.find(string) def mmap_is_in(f, string): fh =...
0
by: Gabriel Genellina | last post by:
En Thu, 29 May 2008 19:17:05 -0300, Kris Kennaway <kris@FreeBSD.org> escribió: Looks like you should define the sq_contains member in mmap_as_sequence, and the type should have the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.