By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,983 Members | 1,597 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,983 IT Pros & Developers. It's quick & easy.

ZODB for inverted index?

P: n/a
Hello,

While playing to write an inverted index (see:
http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with
a classic dict, (i have thousand of documents and millions of terms,
stemming or other filtering are not considered, i wanted to understand
how to handle GB of text first). I found ZODB and try to use it a bit,
but i think i must be misunderstanding how to use it even after reading
http://www.zope.org/Wikis/ZODB/guide/node3.html...

i would like to use it once to build my inverted index, save it to disk
via a FileStorage,

and then reuse this previously created inverted index from the
previously created FileStorage, but it looks like i am unable to
reread/reload it in memory, or i am missing how to do it...

firstly each time i use the code below, it looks everything is added
another time, is there a way to rather rewrite/replace it? and how am i
suppose to use it after an initial creation? i thought that using the
same FileStorage would reload my object inside dbroot, but it doesn't.
i was also interested by the cache mecanisms, are they transparent?

or maybe do you know a good tutorial to understand ZODB?

thx for any help, regards.

here is a sample code :

import sys
from BTrees.OOBTree import OOBTree
from BTrees.OIBTree import OIBTree
from persistent import Persistent

class IDF2:
def __init__(self):
self.docs = OIBTree()
self.idfs = OOBTree()
def add(self, term, fromDoc):
self.docs[fromDoc] = self.docs.get(fromDoc, 0) + 1
if not self.idfs.has_key(term):
self.idfs[term] = OIBTree()
self.idfs[term][fromDoc] = self.idfs[term].get(fromDoc, 0) + 1
def N(self, term):
"total number of occurrences of 'term'"
return sum(self.idfs[term].values())
def n(self, term):
"number of documents containing 'term'"
return len(self.idfs[term])
def ndocs(self):
"number of documents"
return len(self.docs)
def __getitem__(self, key):
return self.idfs[key]
def iterdocs(self):
for doc in self.docs.iterkeys():
yield doc
def iterterms(self):
for term in self.idfs.iterkeys():
yield term

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
if not dbroot.has_key('idfs'):
dbroot['idfs'] = IDF2()
idfs = dbroot['idfs']

import transaction
for i, line in enumerate(open(sys.argv[1])):
# considering doc is linenumber...
for word in line.split():
idfs.add(word, i)
# Commit the change
transaction.commit()

---
i was expecting :

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
print dbroot.has_key('idfs')

=to return True

Oct 23 '06 #1
Share this Question
Share on Google+
5 Replies


P: n/a
You may want to take a quick look at ZCatalogs. They are
for indexing ZODB objects. I may not be understanding
what you are trying to do. I suspect that you really need
to store everything in a database (MySQL/Postgres/etc) for
maximal flexibility.

-Larry
Oct 23 '06 #2

P: n/a

thanks for your reply,

anyway can someone help me on how to "rewrite" and "reload" a class
instance when using ZODB ?

regards

Oct 25 '06 #3

P: n/a
At Wednesday 25/10/2006 03:54, vd*****@yahoo.fr wrote:
>anyway can someone help me on how to "rewrite" and "reload" a class
instance when using ZODB ?
What do you mean?
--
Gabriel Genellina
Softlab SRL

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ˇgratis!
ˇAbrí tu cuenta ya! - http://correo.yahoo.com.ar
Oct 25 '06 #4

P: n/a
vd*****@yahoo.fr wrote:
Hello,
Hi. I'm not familiar with ZODB, but you might consider berkeleydb,
which behaves like a disk-backed + memcache dictionary.

-Mike

Oct 25 '06 #5

P: n/a
vd*****@yahoo.fr wrote:
Hello,

While playing to write an inverted index (see:
http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with
a classic dict, (i have thousand of documents and millions of terms,
stemming or other filtering are not considered, i wanted to understand
how to handle GB of text first). I found ZODB and try to use it a bit,
but i think i must be misunderstanding how to use it even after reading
http://www.zope.org/Wikis/ZODB/guide/node3.html...

i would like to use it once to build my inverted index, save it to disk
via a FileStorage,

and then reuse this previously created inverted index from the
previously created FileStorage, but it looks like i am unable to
reread/reload it in memory, or i am missing how to do it...

firstly each time i use the code below, it looks everything is added
another time, is there a way to rather rewrite/replace it? and how am i
suppose to use it after an initial creation? i thought that using the
same FileStorage would reload my object inside dbroot, but it doesn't.
i was also interested by the cache mecanisms, are they transparent?

or maybe do you know a good tutorial to understand ZODB?

thx for any help, regards.

here is a sample code :

import sys
from BTrees.OOBTree import OOBTree
from BTrees.OIBTree import OIBTree
from persistent import Persistent

class IDF2:
def __init__(self):
self.docs = OIBTree()
self.idfs = OOBTree()
def add(self, term, fromDoc):
self.docs[fromDoc] = self.docs.get(fromDoc, 0) + 1
if not self.idfs.has_key(term):
self.idfs[term] = OIBTree()
self.idfs[term][fromDoc] = self.idfs[term].get(fromDoc, 0) + 1
def N(self, term):
"total number of occurrences of 'term'"
return sum(self.idfs[term].values())
def n(self, term):
"number of documents containing 'term'"
return len(self.idfs[term])
def ndocs(self):
"number of documents"
return len(self.docs)
def __getitem__(self, key):
return self.idfs[key]
def iterdocs(self):
for doc in self.docs.iterkeys():
yield doc
def iterterms(self):
for term in self.idfs.iterkeys():
yield term

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()#
if not dbroot.has_key('idfs'):
dbroot['idfs'] = IDF2()
idfs = dbroot['idfs']

import transaction
for i, line in enumerate(open(sys.argv[1])):
# considering doc is linenumber...
for word in line.split():
idfs.add(word, i)
# Commit the change
transaction.commit()

---
i was expecting :

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
print dbroot.has_key('idfs')

=to return True

you have to have Persistent as base class

class IDF2(Persistent):
....

and maybe (?) reset idfs.idfs=idfs.idfs or do a idfs._p_changed=1 thing or so - don't remember the latter exactly.

but doubt if the memory management of ZODB is intelligent enough (with some extra control?) really improve your task in terms of mem usage (swapping blackout).
Other ideas:

* This is often the best method to balance mem & disk in extreme index applications: use directly the filesystem (thus (escaped) filenames/subdirs) for your index. You just append your pointers to the files. The OS cache system is already a good careful mem/disc balancer - you can do some extra cache logic in your application. This works best with filesystems who can deal well with small files
(but maybe many of your words have long index lists anyway...)
( To maybe reduce number of files/inodes bulk many items into one pickle/shleve/anddbm.. file by using sub hash keys. Example: 1 million words =10000 files x ~100 sub-entries x 10000 refs. )

* a fast relational/dictionary database (mysql)

* Advanced memory mapped file techniques / C-OODBMS ( ObjectStore/PSE ); 64bit OS if 3GB
( thats the technique telecoms often run their tables fast - but this is maybe too advanced ... )
-robert
Oct 26 '06 #6

This discussion thread is closed

Replies have been disabled for this discussion.