472,358 Members | 1,751 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,358 software developers and data experts.

ZODB memory problems (was: processing a Very Large file)

[posted to comp.lang.python, mailed to zo******@zope.org]


I'm having problems storing large amounts of objects in a ZODB.
After committing changes to the database, elements are not cleared from
memory. Since the number of objects I'd like to store in the ZODB is too
large to fit in RAM, my program gets killed with signal 11 or signal 9...

Below a minimal working (or actually: it doesn't work because of memory
example code with hopefully enough comments:

# This was suggested by Tim Peters in comp.lang.python thread
# 'processing a Very Large file'
# It is to make sure that no two or more copies of the same object
# reside in memory
class ObjectInterning:
def __init__(self):
self.object_table = {}

def object_intern(self,o):
return self.object_table.setdefault(o, o)
from sets import Set

# An ExtentedTuple is a tuple with some extra information
# (hence: 'Extended'). Furthermore, the elements of the tuple are
# unique.
# As you can see, ExtendedTuple does not inheret from Persistent.
# It will not be stored in the root of a database directly, it will
# be stored in a Persistent ExtendedTupleTable (see below).
class ExtendedTuple(tuple):

def __init__(self, els):

# This is a set containing other ExtendedTuple objects
# which conflicts with self
# e.g. if self = ExtendedTuple([1,2,3,4]) and
# other = ExtendedTuple([3,4,5]) then self conflicts with
# other, because they share one or more elements (in this
# case:. 3 and 4)
# So, self.conflicts = Set([ExtendedTuple([3,4,5])])
# other.conflicts = Set([ExtendedTuple([1,2,3,4])])
self.conflicts = Set()

def __hash__(self):
return hash(tuple(self))

def __repr__(self):
return 'ExtendedTuple(%s)' % str(list(self))
import ZODB
from persistent import Persistent
import random

# The Persistent ExtendedTupleTable generates and stores a large
# amount of ExtendedTuple objects. Since ExtendedTuple contains a
# Set with other ExtendedTuple objects, each ExtendedTuple object
# may get very large.
class ExtendedTupleTable(Persistent):
def __init__(self):
self.interning = ObjectInterning()

# This Set stores all generated ExtendedTuple objects.
self.ets = Set() # et(s): ExtendedTuple object(s)
# This dictionary stores a mapping of elements to Sets of
# ExtendedTuples.
# eg: self.el2ets[3] = Set([(1,2,3), (3,4,5), (1,3,9)])
# self.el2ets[4] = Set([(3,4,5), (2,4,9)])
self.el2ets = {} # el: element of an ExtendedTuple object

# These dictionaries are here for performance optimizations.
# It is being used to prevent billions of hash()
# calculations (relatively slow compared to dictionary
# lookups)
self._v_el2hs = {} # h(s): hash(es) of ExtendedTuple object(s)
self._v_h2et = {}
self._v_et2h = {}

# The keys of el2ets (and thus the elements of the
# ExtendedTuple objects) are all in a prespecified range.
# In this example: range(200):
self.__el_count = 200
# Number of ExtendedTuple objects in this ExtendedTupleTable
self.__et_count = 5000

# Start generation of ExtendedTuple objects and calculation of
# conflicts for each ExtendedTuple object
def calculate_all(self):

def add(self, et_uninterned):
et = self.interning.object_intern(et_uninterned)
h = self.interning.object_intern(hash(et))
self._v_h2et[h] = et
self._v_et2h[et] = h
self._p_changed = True

def calc_ets(self):
# Calculate a large amount of ExtendedTuple objects.
# In this example, the tuples are random, the elements of
# the tuples are within a prespecified range.
# The elements of each tuple are unique.
print 'generating %s random ExtendedTuple objects' % self.__et_count
for i in xrange(self.__et_count):
# Create random tuple with unique elements
l = []
for el in xrange(self.__el_count/3):
et = ExtendedTuple(tuple(Set(l)))
self.__et_count = len(self.ets)

def calc_el2ets(self):
'''For each el, calculate which et uses that el'''
for el in xrange(self.__el_count):
print 'calculating all ExtendedTuple objects using', el
self.el2ets[el] = Set([ et for et in self.ets if el in et])
self._v_el2hs[el] = Set([ self._v_et2h[et] for et in
self.el2ets[el] ])
self._p_changed = True

def calc_conflicts(self):
'''For each et, calculate the set of conflicting ets'''
self.__et_count = len(self.ets)
commit_interval = 100
for i, et in enumerate(self.ets):
print 'calculating conflicting ExtendedTuple %.2f%%' %
# use the el2et dictionary (faster than 'Cartesian'
# comparison of each ExtendedTuple objects) and an
# optimization dictionary _v_el2hs to prevent billions
# of hash() calculations later on
conflicts_h = [ h for el in et for h in self._v_el2hs[el] ]
# Make sure each element is unique and store the
# result as conflicts Set in the current ExtendedTuple
# object
conflicts_unique = Set(conflicts_h) - Set([ et ])
et.conflicts_set = Set([ self._v_h2et[h] for h in
conflicts_unique ])
self._p_changed = True
if i % commit_interval == 0:
print 'committing data to database...'
# This does NOT seem to work, the memory usage will
# increase until all memory + swap is used. Then the
# process gets killed...

from ZODB import FileStorage, DB

# Open Database
storage = FileStorage.FileStorage('/tmp/test_extendedtuples.fs')
db = DB(storage)
conn = db.open()
root = conn.root()

name = 'test'

if root.has_key(name):
del root[name]

root[name] = ExtendedTupleTable()

rt = root[name]

# if needed, commit final changes and close the database

What should I do to make sure RAM is no longer a limiting factor?
(in other words: The program should work with any (large) value of
self.__range and self.__et_count
Because in my case, self.__et_count = 5000 is only a toy example...)
I'm now working on a PC with 2.5 GB RAM and even that's not enough!

If you think the design of this program is bad, please let me know what you
would do to solve this problem
(calculating and saving tuples + set of conflicts). The only thing I can't
change is that ExtendedTuple inherits from tuple

Thanks in advance,

Jul 19 '05 #1
1 2784
class ExtendedTupleTable(Persistent):
def __init__(self):
self.interning = ObjectInterning()

# This Set stores all generated ExtendedTuple objects.
self.ets = Set() # et(s): ExtendedTuple object(s)
# This dictionary stores a mapping of elements to Sets of
# ExtendedTuples.
# eg: self.el2ets[3] = Set([(1,2,3), (3,4,5), (1,3,9)])
# self.el2ets[4] = Set([(3,4,5), (2,4,9)])
self.el2ets = {} # el: element of an ExtendedTuple object


Note: I might be wrong. I say this here instead of qualifying every
assertion below. Thank you.

If you want more fine-grained swapping-out to disk, you might want to
look at the classes provided by the BTrees modules that come with ZODB.
Built-in container classes like set and dictionary are effectively
opaque to ZODB - they have to be loaded into memory or out to disk as
one whole unit, container and contents. This is true for the Persistent
versions of the containers as well - these are special mostly because
they automatically detect when they are modified.

In order to have some contents of a container pickled out to disk and
others available in memory, you should use BTrees:
root = get_zodb_root_container()
from BTrees import IOBTree
root['el2ets'] = el2ets = IOBTree.IOBTree()
el2ets[3] = Set([(1,2,3), (3,4,5), (1,3,9)])

IOBTree means that its designed to have integer keys and arbitrary
object values. OOBTree means you can use arbitrary objects (e.g.
tuples) as keys. I read that you should avoid using instances of
subclasses of Persistent as keys in BTrees unless you are very careful
implementing __cmp__(); instead confine your keys to objects
constructed from immutable python types, e.g., strings, tuples, tuples
of strings, ...

If you break down the persistent chunks into small enough pieces and
use the transaction commit and abort appropriately (that takes some
experimenting - e.g., on a read-only loop through every element of a
large BTree, I was running out of memory until I called
transaction.abort() every loop), you should max out your memory usage
at some reasonable amount (determined by cache size) no matter how big
your BTree grows.

Jul 19 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

by: Paul Rubin | last post by:
I've started a few threads before on object persistence in medium to high end server apps. This one is about low end apps, for example, a simple cgi on a personal web site that might get a dozen...
by: Almad | last post by:
Hello, I'm going to write a custom CMS. I'd like to use some odbms, as code is then much more cleaner...however, i'm a little bit scared about capabilities of ZoDB, when compared with f. e....
by: Thue Tuxen Sørensen | last post by:
Hi everybody ! I´m maintaining a large intranet (approx 10000 concurrent users) running on one IIS box and one DB box with sqlserver 2000. Currently there is 2,5 GB Ram, 1 1400 mhz cpu and 2...
by: Jim Heavey | last post by:
Hello, I have a fairly simple application which reads none to many input files and edits them and if they are clean, it writes these transaction to the database. Any errors it finds, it opens and...
by: matvdl | last post by:
I have migrated my asp application to asp.net some time ago - but I am still having some difficulties in understanding the best way to mange some tasks. I currently have a page that loads a aspx...
by: Rene Pijlman | last post by:
I have a productional Linux web server with a Python/Zope/Plone. Now I'd like to install a non-Zope Python/ZODB application on the same server. What is the recommended way of doing that? Option...
by: comp.lang.php | last post by:
I have an image that's only 100K in size, and I am working with 8mb of memory. If I do this: print_r(ceil((int)ini_get('memory_limit') * 10 *...
by: vd12005 | last post by:
Hello, While playing to write an inverted index (see: http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with a classic dict, (i have thousand of documents and millions of terms,...
by: Bruno Barberi Gnecco | last post by:
I'm using PHP to run a CLI application. It's a script run by cron that parses some HTML files (with DOM XML), and I ended up using PHP to integrate with the rest of the code that already runs the...
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge required to effectively administer and manage Oracle...
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was proposed, which integrated multiple engines and...
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. header("Location:".$urlback); Is this the right layout the...
by: Arjunsri | last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and credentials and received a successful connection...
by: Matthew3360 | last post by:
Hi, I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web server and have made sure to enable curl. I get a...
by: Oralloy | last post by:
Hello Folks, I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA. My problem (spelled failure) is with the synthesis of my design into a bitstream, not the C++...
by: BLUEPANDA | last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS starter kit that's not only easy to use but also...
by: Rahul1995seven | last post by:
Introduction: In the realm of programming languages, Python has emerged as a powerhouse. With its simplicity, versatility, and robustness, Python has gained popularity among beginners and experts...
by: Ricardo de Mila | last post by:
Dear people, good afternoon... I have a form in msAccess with lots of controls and a specific routine must be triggered if the mouse_down event happens in any control. Than I need to discover what...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.