473,418 Members | 2,640 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,418 software developers and data experts.

ZODB memory problems (was: processing a Very Large file)

[posted to comp.lang.python, mailed to zo******@zope.org]

Hi,

I'm having problems storing large amounts of objects in a ZODB.
After committing changes to the database, elements are not cleared from
memory. Since the number of objects I'd like to store in the ZODB is too
large to fit in RAM, my program gets killed with signal 11 or signal 9...

Below a minimal working (or actually: it doesn't work because of memory
errors)
example code with hopefully enough comments:

# This was suggested by Tim Peters in comp.lang.python thread
# 'processing a Very Large file'
# It is to make sure that no two or more copies of the same object
# reside in memory
class ObjectInterning:
def __init__(self):
self.object_table = {}

def object_intern(self,o):
return self.object_table.setdefault(o, o)
from sets import Set

# An ExtentedTuple is a tuple with some extra information
# (hence: 'Extended'). Furthermore, the elements of the tuple are
# unique.
# As you can see, ExtendedTuple does not inheret from Persistent.
# It will not be stored in the root of a database directly, it will
# be stored in a Persistent ExtendedTupleTable (see below).
class ExtendedTuple(tuple):

def __init__(self, els):
tuple.__init__(self,els)

# This is a set containing other ExtendedTuple objects
# which conflicts with self
# e.g. if self = ExtendedTuple([1,2,3,4]) and
# other = ExtendedTuple([3,4,5]) then self conflicts with
# other, because they share one or more elements (in this
# case:. 3 and 4)
# So, self.conflicts = Set([ExtendedTuple([3,4,5])])
# other.conflicts = Set([ExtendedTuple([1,2,3,4])])
self.conflicts = Set()

def __hash__(self):
return hash(tuple(self))

def __repr__(self):
return 'ExtendedTuple(%s)' % str(list(self))
import ZODB
from persistent import Persistent
import random

# The Persistent ExtendedTupleTable generates and stores a large
# amount of ExtendedTuple objects. Since ExtendedTuple contains a
# Set with other ExtendedTuple objects, each ExtendedTuple object
# may get very large.
class ExtendedTupleTable(Persistent):
def __init__(self):
self.interning = ObjectInterning()

# This Set stores all generated ExtendedTuple objects.
self.ets = Set() # et(s): ExtendedTuple object(s)
# This dictionary stores a mapping of elements to Sets of
# ExtendedTuples.
# eg: self.el2ets[3] = Set([(1,2,3), (3,4,5), (1,3,9)])
# self.el2ets[4] = Set([(3,4,5), (2,4,9)])
self.el2ets = {} # el: element of an ExtendedTuple object

# These dictionaries are here for performance optimizations.
# It is being used to prevent billions of hash()
# calculations (relatively slow compared to dictionary
# lookups)
self._v_el2hs = {} # h(s): hash(es) of ExtendedTuple object(s)
self._v_h2et = {}
self._v_et2h = {}

# The keys of el2ets (and thus the elements of the
# ExtendedTuple objects) are all in a prespecified range.
# In this example: range(200):
self.__el_count = 200
# Number of ExtendedTuple objects in this ExtendedTupleTable
self.__et_count = 5000

# Start generation of ExtendedTuple objects and calculation of
# conflicts for each ExtendedTuple object
def calculate_all(self):
self.calc_ets()
self.calc_el2ets()
self.calc_conflicts()

def add(self, et_uninterned):
et = self.interning.object_intern(et_uninterned)
h = self.interning.object_intern(hash(et))
self.ets.add(et)
self._v_h2et[h] = et
self._v_et2h[et] = h
self._p_changed = True

def calc_ets(self):
# Calculate a large amount of ExtendedTuple objects.
# In this example, the tuples are random, the elements of
# the tuples are within a prespecified range.
# The elements of each tuple are unique.
print 'generating %s random ExtendedTuple objects' % self.__et_count
for i in xrange(self.__et_count):
# Create random tuple with unique elements
l = []
for el in xrange(self.__el_count/3):
l.append(random.randint(0,self.__el_count-1))
et = ExtendedTuple(tuple(Set(l)))
self.add(et)
self.__et_count = len(self.ets)

def calc_el2ets(self):
'''For each el, calculate which et uses that el'''
for el in xrange(self.__el_count):
print 'calculating all ExtendedTuple objects using', el
self.el2ets[el] = Set([ et for et in self.ets if el in et])
self._v_el2hs[el] = Set([ self._v_et2h[et] for et in
self.el2ets[el] ])
self._p_changed = True

def calc_conflicts(self):
'''For each et, calculate the set of conflicting ets'''
self.__et_count = len(self.ets)
commit_interval = 100
for i, et in enumerate(self.ets):
print 'calculating conflicting ExtendedTuple %.2f%%' %
((i+1)*100./self.__et_count)
# use the el2et dictionary (faster than 'Cartesian'
# comparison of each ExtendedTuple objects) and an
# optimization dictionary _v_el2hs to prevent billions
# of hash() calculations later on
conflicts_h = [ h for el in et for h in self._v_el2hs[el] ]
# Make sure each element is unique and store the
# result as conflicts Set in the current ExtendedTuple
# object
conflicts_unique = Set(conflicts_h) - Set([ et ])
et.conflicts_set = Set([ self._v_h2et[h] for h in
conflicts_unique ])
self._p_changed = True
if i % commit_interval == 0:
print 'committing data to database...'
# This does NOT seem to work, the memory usage will
# increase until all memory + swap is used. Then the
# process gets killed...
get_transaction().commit(True)
get_transaction().commit(True)


from ZODB import FileStorage, DB

# Open Database
storage = FileStorage.FileStorage('/tmp/test_extendedtuples.fs')
db = DB(storage)
conn = db.open()
root = conn.root()

name = 'test'

if root.has_key(name):
del root[name]

root[name] = ExtendedTupleTable()

rt = root[name]
rt.calculate_all()

# if needed, commit final changes and close the database
get_transaction().commit()
conn.close()
db.pack()
db.close()
storage.close()

What should I do to make sure RAM is no longer a limiting factor?
(in other words: The program should work with any (large) value of
self.__range and self.__et_count
Because in my case, self.__et_count = 5000 is only a toy example...)
I'm now working on a PC with 2.5 GB RAM and even that's not enough!

If you think the design of this program is bad, please let me know what you
would do to solve this problem
(calculating and saving tuples + set of conflicts). The only thing I can't
change is that ExtendedTuple inherits from tuple

Thanks in advance,
Stan.


Jul 19 '05 #1
1 2893
class ExtendedTupleTable(Persistent):
def __init__(self):
self.interning = ObjectInterning()

# This Set stores all generated ExtendedTuple objects.
self.ets = Set() # et(s): ExtendedTuple object(s)
# This dictionary stores a mapping of elements to Sets of
# ExtendedTuples.
# eg: self.el2ets[3] = Set([(1,2,3), (3,4,5), (1,3,9)])
# self.el2ets[4] = Set([(3,4,5), (2,4,9)])
self.el2ets = {} # el: element of an ExtendedTuple object

#######

Note: I might be wrong. I say this here instead of qualifying every
assertion below. Thank you.

If you want more fine-grained swapping-out to disk, you might want to
look at the classes provided by the BTrees modules that come with ZODB.
Built-in container classes like set and dictionary are effectively
opaque to ZODB - they have to be loaded into memory or out to disk as
one whole unit, container and contents. This is true for the Persistent
versions of the containers as well - these are special mostly because
they automatically detect when they are modified.

In order to have some contents of a container pickled out to disk and
others available in memory, you should use BTrees:
root = get_zodb_root_container()
from BTrees import IOBTree
root['el2ets'] = el2ets = IOBTree.IOBTree()
transaction.commit()
el2ets[3] = Set([(1,2,3), (3,4,5), (1,3,9)])
transaction.commit()


IOBTree means that its designed to have integer keys and arbitrary
object values. OOBTree means you can use arbitrary objects (e.g.
tuples) as keys. I read that you should avoid using instances of
subclasses of Persistent as keys in BTrees unless you are very careful
implementing __cmp__(); instead confine your keys to objects
constructed from immutable python types, e.g., strings, tuples, tuples
of strings, ...

If you break down the persistent chunks into small enough pieces and
use the transaction commit and abort appropriately (that takes some
experimenting - e.g., on a read-only loop through every element of a
large BTree, I was running out of memory until I called
transaction.abort() every loop), you should max out your memory usage
at some reasonable amount (determined by cache size) no matter how big
your BTree grows.

Jul 19 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

49
by: Paul Rubin | last post by:
I've started a few threads before on object persistence in medium to high end server apps. This one is about low end apps, for example, a simple cgi on a personal web site that might get a dozen...
6
by: Almad | last post by:
Hello, I'm going to write a custom CMS. I'd like to use some odbms, as code is then much more cleaner...however, i'm a little bit scared about capabilities of ZoDB, when compared with f. e....
19
by: Thue Tuxen Sørensen | last post by:
Hi everybody ! I´m maintaining a large intranet (approx 10000 concurrent users) running on one IIS box and one DB box with sqlserver 2000. Currently there is 2,5 GB Ram, 1 1400 mhz cpu and 2...
2
by: Jim Heavey | last post by:
Hello, I have a fairly simple application which reads none to many input files and edits them and if they are clean, it writes these transaction to the database. Any errors it finds, it opens and...
7
by: matvdl | last post by:
I have migrated my asp application to asp.net some time ago - but I am still having some difficulties in understanding the best way to mange some tasks. I currently have a page that loads a aspx...
3
by: Rene Pijlman | last post by:
I have a productional Linux web server with a Python/Zope/Plone. Now I'd like to install a non-Zope Python/ZODB application on the same server. What is the recommended way of doing that? Option...
6
by: comp.lang.php | last post by:
I have an image that's only 100K in size, and I am working with 8mb of memory. If I do this: print_r(ceil((int)ini_get('memory_limit') * 10 *...
5
by: vd12005 | last post by:
Hello, While playing to write an inverted index (see: http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with a classic dict, (i have thousand of documents and millions of terms,...
9
by: Bruno Barberi Gnecco | last post by:
I'm using PHP to run a CLI application. It's a script run by cron that parses some HTML files (with DOM XML), and I ended up using PHP to integrate with the rest of the code that already runs the...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.