Re: dynamic allocation file buffer

On Tue, 09 Sep 2008 14:59:19 -0700, castironpi wrote:

I will try my idea again. I want to talk to people about a module I
want to write and I will take the time to explain it. I think it's a
"cool idea" that a lot of people, forgiving the slang, could benefit
from. What are its flaws?

[snip long description with not-very-credible use-cases]

You've created a solution to a problem which (probably) only affects a
very small number of people, at least judging by your use-cases. Who has
a 4GB XML file, and how much crack did they smoke?

Castironpi, what do *you* use this proof-of-concept module for? Don't
bother tell us what you think *we* should use it for. Tell us what you're
using it for, or at least what somebody else is using it for. If this is
just a module that you think will be cool, I don't like your chances of
people caring. There is no shortage of "cool" software that isn't useful
for anything, and unlike eye-candy, nobody is going to use your module
just because they like the algorithm.

If you don't have an existing application for the software, then explain
what it does (not how) and give some idea of the performance ("it's alpha
and written in Python and really slow, but I will re-write it in C and
expect it to make a billion random accesses in a 10GB file per
millisecond", or whatever). You might be lucky and have somebody say
"Hey, that's just the tool I need to solve my problem!".
--
Steven

Sep 9 '08 #1

Subscribe Post Reply

1591

castironpi

On Sep 9, 5:58*pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.auwrote:

On Tue, 09 Sep 2008 14:59:19 -0700, castironpi wrote:
I will try my idea again. *I want to talk to people about a module I
want to write and I will take the time to explain it. *I think it's a
"cool idea" that a lot of people, forgiving the slang, could benefit
from. *What are its flaws?

[snip long description with not-very-credible use-cases]

Steven,

You've created a solution to a problem which (probably) only affects a
very small number of people, at least judging by your use-cases. Who has
a 4GB XML file, and how much crack did they smoke?

I judge from the existence of 'shelve' and 'pickle' modules, and
relational database packages, that the problem I am addressing is not
rare. It could be the millionaire investor across the street, the
venture capitalist down the hall, or the guy with a huge CD catalog.

Castironpi, what do *you* use this proof-of-concept module for?

Honestly, nothing yet. I just wrote it. My user community and
customer base are very small. Originally, I wanted to store variable-
length strings in a file, where shelves and databases were overkill.
I created it for its beauty, sorry to disappoint.

Don't
bother tell us what you think *we* should use it for. Tell us what you're
using it for, or at least what somebody else is using it for. If this is
just a module that you think will be cool, I don't like your chances of
people caring. There is no shortage of "cool" software that isn't useful
for anything, and unlike eye-candy, nobody is going to use your module
just because they like the algorithm.

Unfortunately, nobody is going to care about most of the uses I have
for it 'til I have a job. I'm goofing around with a laptop,
remembering when my databases professor kept dropping the ball on
VARCHARs. If you want a sound byte, think, "imagine programming
without 'new' and 'malloc'."

If you don't have an existing application for the software, then explain
what it does (not how) and give some idea of the performance ("it's alpha
and written in Python and really slow, but I will re-write it in C and
expect it to make a billion random accesses in a 10GB file per
millisecond", or whatever). You might be lucky and have somebody say
"Hey, that's just the tool I need to solve my problem!".

I wrote a Rope implementation just to test drive it. It exceeded the
native immutable string type at 2 megs. It used 'struct' instead of
'ctypes', so that number could conceivably come down. I am intending
to leave it in pure Python, so there.

--
Steven

Pleasure chatting as always sir.

Sep 10 '08 #2

Fredrik Lundh

Steven D'Aprano wrote:

You've created a solution to a problem which (probably) only affects a
very small number of people, at least judging by your use-cases. Who has
a 4GB XML file

Getting 4GB XML files from, say, logging processes or databases that can
render their output as XML is not that uncommon. They're usually
record-oriented, and are intended to be processed as streams. And given
the right tools, doing that is no harder than doing the same to a 4GB
text file.

</F>

Sep 10 '08 #3

Steven D'Aprano

On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:

Steven D'Aprano wrote:

>You've created a solution to a problem which (probably) only affects a
very small number of people, at least judging by your use-cases. Who
has a 4GB XML file

Getting 4GB XML files from, say, logging processes or databases that can
render their output as XML is not that uncommon. They're usually
record-oriented, and are intended to be processed as streams. And given
the right tools, doing that is no harder than doing the same to a 4GB
text file.

Fair enough, that's a good point.

But would you expect random access to a 4GB XML file? If I've understood
what Castironpi is trying for, his primary use case was for people
wanting exactly that.
--
Steven

Sep 10 '08 #4

Aaron \Castironpi\ Brady

On Sep 10, 5:24*am, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.auwrote:

On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:
Steven D'Aprano wrote:

You've created a solution to a problem which (probably) only affects a
very small number of people, at least judging by your use-cases. Who
has a 4GB XML file

Getting 4GB XML files from, say, logging processes or databases that can
render their output as XML is not that uncommon. *They're usually
record-oriented, and are intended to be processed as streams. *And given
the right tools, doing that is no harder than doing the same to a 4GB
text file.

Fair enough, that's a good point.

But would you expect random access to a 4GB XML file? If I've understood
what Castironpi is trying for, his primary use case was for people
wanting exactly that.

--
Steven

Steven,

Are you claiming that sequential storage is sufficient for small
amounts of data, and relational db.s are necessary for large amounts?
It's possible that there is only the fringe exception, in which case
'alloc/free' aren't useful in the majority of cases, and will never
win customers away from the more mature competition.

Regardless, it is an elegant solution to the problem of storing
variable-length strings, with hardly any practical value. Perfect for
grad school.

Sep 10 '08 #5

Steven D'Aprano

On Wed, 10 Sep 2008 11:59:35 -0700, Aaron \"Castironpi\" Brady wrote:

On Sep 10, 5:24Â*am, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.auwrote:
>On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:
Steven D'Aprano wrote:

>You've created a solution to a problem which (probably) only affects
a very small number of people, at least judging by your use-cases.
Who has a 4GB XML file

Getting 4GB XML files from, say, logging processes or databases that
can render their output as XML is not that uncommon. Â*They're usually
record-oriented, and are intended to be processed as streams. Â*And
given the right tools, doing that is no harder than doing the same to
a 4GB text file.

Fair enough, that's a good point.

But would you expect random access to a 4GB XML file? If I've
understood what Castironpi is trying for, his primary use case was for
people wanting exactly that.

--
Steven

Steven,

Are you claiming that sequential storage is sufficient for small amounts
of data, and relational db.s are necessary for large amounts?

I'm no longer *claiming* anything, I'm *asking* whether random access to
a 4GB XML file is something that is credible or useful. It is my
understanding that XML is particularly ill-suited to random access once
the amount of data is too large to fit in RAM.

I'm interested in what Fredrik has to say about this, as he's the author
of ElementTree.

--
Steven

Sep 11 '08 #6

Fredrik Lundh

Steven D'Aprano wrote:

I'm no longer *claiming* anything, I'm *asking* whether random access to
a 4GB XML file is something that is credible or useful. It is my
understanding that XML is particularly ill-suited to random access once
the amount of data is too large to fit in RAM.

An XML file doesn't contain any indexing information, so random access
to a large XML file is very inefficient. You can build (or precompute)
index information and store in a separate file, of course, but that's
hardly something that's useful in the general case.

And as I said before, the only use case for *huge* XML files I've ever
seen used in practice is to store large streams of record-style data;
data that's intended to be consumed by sequential processes (and you can
do a lot with sequential processing these days; for those interested in
this, digging up a few review papers on "data stream processing" might
be a good way to waste some time).

Document-style XML usually fits into memory on modern machines;
structures larger than that are usually split into different parts (e.g.
using XInclude) and stored in a container file.

Random *modifications* to an arbitrary XML file cannot be done, as long
as you store the file in a standard file system. And if you invent your
own format, it's no longer an XML file.

</F>

Sep 11 '08 #7

Paul Boddie

On 11 Sep, 10:34, Fredrik Lundh <fred...@pythonware.comwrote:

>
And as I said before, the only use case for *huge* XML files I've ever
seen used in practice is to store large streams of record-style data;

I can imagine that the manipulation of the persistent form of large
graph structures might be another use case, although for efficient
navigation of such a structure, which is what you'd need to start
applying various graph algorithms, one would need some kind of index.
Certainly, we're straying into database territory.

Paul

Sep 11 '08 #8

Aaron \Castironpi\ Brady

On Sep 11, 2:40*am, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.auwrote:

On Wed, 10 Sep 2008 11:59:35 -0700, Aaron \"Castironpi\" Brady wrote:
On Sep 10, 5:24*am, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.auwrote:
On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:
Steven D'Aprano wrote:

You've created a solution to a problem which (probably) only affects
a very small number of people, at least judging by your use-cases.
Who has a 4GB XML file

Getting 4GB XML files from, say, logging processes or databases that
can render their output as XML is not that uncommon. *They're usually
record-oriented, and are intended to be processed as streams. *And
given the right tools, doing that is no harder than doing the same to
a 4GB text file.

Fair enough, that's a good point.

But would you expect random access to a 4GB XML file? If I've
understood what Castironpi is trying for, his primary use case was for
people wanting exactly that.

--
Steven

Steven,

Are you claiming that sequential storage is sufficient for small amounts
of data, and relational db.s are necessary for large amounts?

I'm no longer *claiming* anything, I'm *asking* whether random access to
a 4GB XML file is something that is credible or useful. It is my
understanding that XML is particularly ill-suited to random access once
the amount of data is too large to fit in RAM.

I'm interested in what Fredrik has to say about this, as he's the author
of ElementTree.

--
Steven

XML is the wrong word for the example I was thinking of (as was
already pointed out in another thread). XML is by definition
sequential. The use case pertained to a generic element hierarchy;
think of 4GB of hierarchical data.

Sep 11 '08 #9

Aaron \Castironpi\ Brady

On Sep 11, 5:35*am, Paul Boddie <p...@boddie.org.ukwrote:

On 11 Sep, 10:34, Fredrik Lundh <fred...@pythonware.comwrote:

And as I said before, the only use case for *huge* XML files I've ever
seen used in practice is to store large streams of record-style data;

I can imagine that the manipulation of the persistent form of large
graph structures might be another use case, although for efficient
navigation of such a structure, which is what you'd need to start
applying various graph algorithms, one would need some kind of index.
Certainly, we're straying into database territory.

Paul

An acquaintance suggests that defragmentation would be a useful
service to provide along with memory management too, which also
requires an index.

I encourage overlap between a bare-bones alloc/free module and
established database territory and I'm very aware of it.

Databases already support both concurrency and persistence, but don't
tell me you'd use a database for IPC. And don't tell me you've never
wished you had a reference to a record in a table so that you could
make an update just by changing one word of memory at the right
place. Sometimes databases are overkill where all you want is dynamic
allocation.

Sep 11 '08 #10

Paul Boddie

On 11 Sep, 19:31, "Aaron \"Castironpi\" Brady" <castiro...@gmail.com>
wrote:

>
An acquaintance suggests that defragmentation would be a useful
service to provide along with memory management too, which also
requires an index.

I presume that you mean efficient access to large amounts of data in
the sense that if all the data you want happens to be in the same page
or segment, then retrieving it is much more efficient than having to
seek around for all the different pieces. So the defragmentation would
be what they call clustering in a relational database context:

http://www.postgresql.org/docs/8.3/s...l-cluster.html

I've seen similar phenomena outside the relational database world,
notably with big Lucene indexes which wouldn't fit in memory in their
entirety.

I encourage overlap between a bare-bones alloc/free module and
established database territory and I'm very aware of it.

Databases already support both concurrency and persistence, but don't
tell me you'd use a database for IPC.

Of course, databases are widely used in scalable systems to hold
central state, which is why there's a lot of effort put into to not
only scaling up database installations, but also into things like
caching which are supposed to save the database systems behind popular
Web applications from excessive load.

And don't tell me you've never
wished you had a reference to a record in a table so that you could
make an update just by changing one word of memory at the right
place. Sometimes databases are overkill where all you want is dynamic
allocation.

I think that the challenge is to reduce an abstract operation (for
example, wanting to update a particular column in a particular record)
to its measurable effects (this word of memory/disk will change as a
consequence). It's easy for a human with a reasonable knowledge of,
say, a relational database system to anticipate such things, but to
actually collapse a number of layers through some kind of generic
optimisation process is a lot more difficult.

Paul

Sep 11 '08 #11

Steven D'Aprano

On Thu, 11 Sep 2008 10:20:41 -0700, Aaron \"Castironpi\" Brady wrote:

XML is the wrong word for the example I was thinking of (as was already
pointed out in another thread). XML is by definition sequential.

I'm pretty sure you're wrong. XML can be used for serialization, but that
doesn't mean it is only sequential data. XML is suitable for hierarchical
data too. To quote Wikipedia:

"As long as only well-formedness is required, XML is a generic framework
for storing any amount of text or any data whose structure can be
represented as a tree. The only indispensable syntactical requirement is
that the document has exactly one root element (alternatively called the
document element)."

http://en.wikipedia.org/wiki/Xml

--
Steven

Sep 12 '08 #12

Aaron \Castironpi\ Brady

On Sep 11, 10:37*pm, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.auwrote:

On Thu, 11 Sep 2008 10:20:41 -0700, Aaron \"Castironpi\" Brady wrote:
XML is the wrong word for the example I was thinking of (as was already
pointed out in another thread). *XML is by definition sequential.

I'm pretty sure you're wrong. XML can be used for serialization, but that
doesn't mean it is only sequential data. XML is suitable for hierarchical
data too. To quote Wikipedia:

"As long as only well-formedness is required, XML is a generic framework
for storing any amount of text or any data whose structure can be
represented as a tree. The only indispensable syntactical requirement is
that the document has exactly one root element (alternatively called the
document element)."

http://en.wikipedia.org/wiki/Xml

--
Steven

That's my choice of words at work again, I'm afraid. What I mean is,
there is no possibility that you can correctly interpret a segment of
XML text without knowing certain facts about everything that precedes
it. Compare to the case of a fixed-length record file, of record size
say 20, where you know the meaning of the characters in offset ranges
20-40, 80-100, 500020-500040, etc.

To clarify the point of the use case in question, because data would
be allocated and located dynamically, its possible that you could read
the first several words, then not need anything until say, the 1KB
mark. (Unless you're somehow storing an offset in to an XML string as
a value in the string, which would require composing it, leaving room
for that value, and then writing it with random access anyway.) There
can be gaps in a dynamically managed buffer--- say the unused/free
bytes from offsets 200 to 220, but every byte that follows another in
an XML file follows it in the file's meaning too. Is this any
clearer?

Aaron

Sep 12 '08 #13

Steven D'Aprano

On Thu, 11 Sep 2008 22:40:01 -0700, Dennis Lee Bieber wrote:

On 12 Sep 2008 03:37:51 GMT, Steven D'Aprano
<st****@REMOVE.THIS.cybersource.com.audeclaimed the following in
comp.lang.python:

>I'm pretty sure you're wrong. XML can be used for serialization, but
that doesn't mean it is only sequential data. XML is suitable for
hierarchical data too. To quote Wikipedia:

There is a difference between the format of the data content, and
the processing of that data... Regardless of the content, one
essentially has to process the XML /file/ sequentially, and translate
into an in-memory model that allows for accessing said data. To reach
the nth subelement of the mth element requires reading all 1..m-1
elements, followed by all 1..n-1 subelements in m. Modifying any element
requires rewriting the entire file.

Which is why I previously said that XML was not well suited for random
access.

I think we're starting to be sucked into a vortex of obtuse and opaque
communication. We agree that XML can store hierarchical data, and that it
has to be read and written sequentially, and that whatever the merits of
castironpi's software, his original use-case of random access to a 4GB
XML file isn't workable. Yes?

--
Steven

Sep 12 '08 #14

Aaron \Castironpi\ Brady

On Sep 12, 1:30*am, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.auwrote:

On Thu, 11 Sep 2008 22:40:01 -0700, Dennis Lee Bieber wrote:
On 12 Sep 2008 03:37:51 GMT, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.audeclaimed the following in
comp.lang.python:

I'm pretty sure you're wrong. XML can be used for serialization, but
that doesn't mean it is only sequential data. XML is suitable for
hierarchical data too. To quote Wikipedia:

* *There is a difference between the format of the data content, and
the processing of that data... Regardless of the content, one
essentially has to process the XML /file/ sequentially, and translate
into an in-memory model that allows for accessing said data. To reach
the nth subelement of the mth element requires reading all 1..m-1
elements, followed by all 1..n-1 subelements in m. Modifying any element
requires rewriting the entire file.

Which is why I previously said that XML was not well suited for random
access.

I think we're starting to be sucked into a vortex of obtuse and opaque
communication. We agree that XML can store hierarchical data, and that it
has to be read and written sequentially, and that whatever the merits of
castironpi's software, his original use-case of random access to a 4GB
XML file isn't workable. Yes?

--
Steven

By 'isn't workable' do you mean, "no one ever uses 4GB of XML", or "no
one ever uses 4GB or hierarchical data period"?

Sep 12 '08 #15

Paul Boddie

On 12 Sep, 08:30, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.auwrote:

>
Which is why I previously said that XML was not well suited for random
access.

Maybe not. A consideration of other storage formats such as HDF5 might
be appropriate:

http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html

There are, of course, HDF5 tools available for Python.

I think we're starting to be sucked into a vortex of obtuse and opaque
communication.

I don't know about that. I'm managing to keep up with the discussion.

We agree that XML can store hierarchical data, and that it
has to be read and written sequentially, and that whatever the merits of
castironpi's software, his original use-case of random access to a 4GB
XML file isn't workable. Yes?

Again, XML specifically might not be workable for random access in a
serialised form, despite people's best efforts at processing it in
various unconventional ways, but that doesn't mean that random access
to a 4GB file containing hierarchical data isn't possible, so I
suppose it depends on whether he is wedded to the idea of using
vanilla XML or not. It's always worth exploring the available
alternatives before embarking on a challenging project, unless one
wants to pursue the exercise as a learning experience, and I therefore
suggest investigating whether HDF5 doesn't already solve at least some
of the problems or use-cases stated in this discussion.

Paul

Sep 12 '08 #16

Aaron \Castironpi\ Brady

On Sep 12, 4:34*am, Paul Boddie <p...@boddie.org.ukwrote:

On 12 Sep, 08:30, Steven D'Aprano

<ste...@REMOVE.THIS.cybersource.com.auwrote:

Which is why I previously said that XML was not well suited for random
access.

Maybe not.

No, it's not. Element trees are, which if I just would have said
originally...

A consideration of other storage formats such as HDF5 might
be appropriate:

http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html

There are, of course, HDF5 tools available for Python.

PyTables came up within the past few weeks on the list.

"When the file is created, the metadata in the object tree is updated
in memory while the actual data is saved to disk. When you close the
file the object tree is no longer available. However, when you reopen
this file the object tree will be reconstructed in memory from the
metadata on disk...."

This is different from what I had in mind, but the extremity depends
on how slow the 'reconstructed in memory' step is. (From
http://www.pytables.org/docs/manual/ch01.html#id2506782 ). The
counterexample would be needing random access into multiple data
files, which don't all fit in memory at once, but the maturity of the
package might outweigh that. Reconstruction will form a bottleneck
anyway.

I think we're starting to be sucked into a vortex of obtuse and opaque
communication.

I don't know about that. I'm managing to keep up with the discussion.

We agree that XML can store hierarchical data, and that it
has to be read and written sequentially, and that whatever the merits of
castironpi's software, his original use-case of random access to a 4GB
XML file isn't workable. Yes?

I could renege that bid and talk about a 4MB file, where recopying is
prohibitively expensive and so random access is needed, thereby
requiring an alternative to XML.

Again, XML specifically might not be workable for random access in a
serialised form, despite people's best efforts at processing it in
various unconventional ways, but that doesn't mean that random access
to a 4GB file containing hierarchical data isn't possible, so I
suppose it depends on whether he is wedded to the idea of using
vanilla XML or not.

No. It is always nice to be able to scroll through your data, but
it's much less common to be able to scroll though a data -structure-.
(Which is part of the reason data structures are hard to design.)

It's always worth exploring the available
alternatives before embarking on a challenging project, unless one
wants to pursue the exercise as a learning experience, and I therefore
suggest investigating whether HDF5 doesn't already solve at least some
of the problems or use-cases stated in this discussion.

The potential for concurrency is definitely one benefit of raw alloc/
free management, and a requirement I was setting out to program
directly for. There is a multi-threaded version of HDF5 but
interprocess communication is unsupported.

"This version serializes the API suitable for use in a multi-threaded
application but does not provide any level of concurrency."

From: http://www.hdfgroup.uiuc.edu/papers/features/mthdf/

(It is always appreciated to find a statement of what a product does
not do.)

Paul

There is an updated statement of the problem on the project website:

http://code.google.com/p/pymmapstruc...mmapstruct.txt

I don't have numbers for my claim that the abstraction layers in SQL,
including string construction and parsing, are ever a bottleneck or
limiting factor, despite that it's sort of intuitive. Until I get
those, maybe I should leave that allegation out.

Compared to the complexity of all these other packages (ZOPE,
memcached, HDF5/PyTables), alloc and free are almost looking like they
should become methods on a subclass of the builtin buffer type. Ha!
(Ducks.) They're beyond dangerous compared to the snuggly feeling of
Python though, so maybe they could belong in ctypes.

Aaron

Sep 12 '08 #17

Re: dynamic allocation file buffer

Similar topics