473,320 Members | 1,910 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

ZODB for inverted index?

Hello,

While playing to write an inverted index (see:
http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with
a classic dict, (i have thousand of documents and millions of terms,
stemming or other filtering are not considered, i wanted to understand
how to handle GB of text first). I found ZODB and try to use it a bit,
but i think i must be misunderstanding how to use it even after reading
http://www.zope.org/Wikis/ZODB/guide/node3.html...

i would like to use it once to build my inverted index, save it to disk
via a FileStorage,

and then reuse this previously created inverted index from the
previously created FileStorage, but it looks like i am unable to
reread/reload it in memory, or i am missing how to do it...

firstly each time i use the code below, it looks everything is added
another time, is there a way to rather rewrite/replace it? and how am i
suppose to use it after an initial creation? i thought that using the
same FileStorage would reload my object inside dbroot, but it doesn't.
i was also interested by the cache mecanisms, are they transparent?

or maybe do you know a good tutorial to understand ZODB?

thx for any help, regards.

here is a sample code :

import sys
from BTrees.OOBTree import OOBTree
from BTrees.OIBTree import OIBTree
from persistent import Persistent

class IDF2:
def __init__(self):
self.docs = OIBTree()
self.idfs = OOBTree()
def add(self, term, fromDoc):
self.docs[fromDoc] = self.docs.get(fromDoc, 0) + 1
if not self.idfs.has_key(term):
self.idfs[term] = OIBTree()
self.idfs[term][fromDoc] = self.idfs[term].get(fromDoc, 0) + 1
def N(self, term):
"total number of occurrences of 'term'"
return sum(self.idfs[term].values())
def n(self, term):
"number of documents containing 'term'"
return len(self.idfs[term])
def ndocs(self):
"number of documents"
return len(self.docs)
def __getitem__(self, key):
return self.idfs[key]
def iterdocs(self):
for doc in self.docs.iterkeys():
yield doc
def iterterms(self):
for term in self.idfs.iterkeys():
yield term

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
if not dbroot.has_key('idfs'):
dbroot['idfs'] = IDF2()
idfs = dbroot['idfs']

import transaction
for i, line in enumerate(open(sys.argv[1])):
# considering doc is linenumber...
for word in line.split():
idfs.add(word, i)
# Commit the change
transaction.commit()

---
i was expecting :

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
print dbroot.has_key('idfs')

=to return True

Oct 23 '06 #1
5 2625
You may want to take a quick look at ZCatalogs. They are
for indexing ZODB objects. I may not be understanding
what you are trying to do. I suspect that you really need
to store everything in a database (MySQL/Postgres/etc) for
maximal flexibility.

-Larry
Oct 23 '06 #2

thanks for your reply,

anyway can someone help me on how to "rewrite" and "reload" a class
instance when using ZODB ?

regards

Oct 25 '06 #3
At Wednesday 25/10/2006 03:54, vd*****@yahoo.fr wrote:
>anyway can someone help me on how to "rewrite" and "reload" a class
instance when using ZODB ?
What do you mean?
--
Gabriel Genellina
Softlab SRL

__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar
Oct 25 '06 #4
vd*****@yahoo.fr wrote:
Hello,
Hi. I'm not familiar with ZODB, but you might consider berkeleydb,
which behaves like a disk-backed + memcache dictionary.

-Mike

Oct 25 '06 #5
vd*****@yahoo.fr wrote:
Hello,

While playing to write an inverted index (see:
http://en.wikipedia.org/wiki/Inverted_index), i run out of memory with
a classic dict, (i have thousand of documents and millions of terms,
stemming or other filtering are not considered, i wanted to understand
how to handle GB of text first). I found ZODB and try to use it a bit,
but i think i must be misunderstanding how to use it even after reading
http://www.zope.org/Wikis/ZODB/guide/node3.html...

i would like to use it once to build my inverted index, save it to disk
via a FileStorage,

and then reuse this previously created inverted index from the
previously created FileStorage, but it looks like i am unable to
reread/reload it in memory, or i am missing how to do it...

firstly each time i use the code below, it looks everything is added
another time, is there a way to rather rewrite/replace it? and how am i
suppose to use it after an initial creation? i thought that using the
same FileStorage would reload my object inside dbroot, but it doesn't.
i was also interested by the cache mecanisms, are they transparent?

or maybe do you know a good tutorial to understand ZODB?

thx for any help, regards.

here is a sample code :

import sys
from BTrees.OOBTree import OOBTree
from BTrees.OIBTree import OIBTree
from persistent import Persistent

class IDF2:
def __init__(self):
self.docs = OIBTree()
self.idfs = OOBTree()
def add(self, term, fromDoc):
self.docs[fromDoc] = self.docs.get(fromDoc, 0) + 1
if not self.idfs.has_key(term):
self.idfs[term] = OIBTree()
self.idfs[term][fromDoc] = self.idfs[term].get(fromDoc, 0) + 1
def N(self, term):
"total number of occurrences of 'term'"
return sum(self.idfs[term].values())
def n(self, term):
"number of documents containing 'term'"
return len(self.idfs[term])
def ndocs(self):
"number of documents"
return len(self.docs)
def __getitem__(self, key):
return self.idfs[key]
def iterdocs(self):
for doc in self.docs.iterkeys():
yield doc
def iterterms(self):
for term in self.idfs.iterkeys():
yield term

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()#
if not dbroot.has_key('idfs'):
dbroot['idfs'] = IDF2()
idfs = dbroot['idfs']

import transaction
for i, line in enumerate(open(sys.argv[1])):
# considering doc is linenumber...
for word in line.split():
idfs.add(word, i)
# Commit the change
transaction.commit()

---
i was expecting :

storage = FileStorage.FileStorage("%s.fs" % sys.argv[1])
db = DB(storage)
conn = db.open()
dbroot = conn.root()
print dbroot.has_key('idfs')

=to return True

you have to have Persistent as base class

class IDF2(Persistent):
....

and maybe (?) reset idfs.idfs=idfs.idfs or do a idfs._p_changed=1 thing or so - don't remember the latter exactly.

but doubt if the memory management of ZODB is intelligent enough (with some extra control?) really improve your task in terms of mem usage (swapping blackout).
Other ideas:

* This is often the best method to balance mem & disk in extreme index applications: use directly the filesystem (thus (escaped) filenames/subdirs) for your index. You just append your pointers to the files. The OS cache system is already a good careful mem/disc balancer - you can do some extra cache logic in your application. This works best with filesystems who can deal well with small files
(but maybe many of your words have long index lists anyway...)
( To maybe reduce number of files/inodes bulk many items into one pickle/shleve/anddbm.. file by using sub hash keys. Example: 1 million words =10000 files x ~100 sub-entries x 10000 refs. )

* a fast relational/dictionary database (mysql)

* Advanced memory mapped file techniques / C-OODBMS ( ObjectStore/PSE ); 64bit OS if 3GB
( thats the technique telecoms often run their tables fast - but this is maybe too advanced ... )
-robert
Oct 26 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Richard Shea | last post by:
Hi - I'm trying to use ZODB but I'm having trouble getting started. Can anyone point me at a very simple example ? I've had a look at 'ZODB/ZEO Programming Guide'...
49
by: Paul Rubin | last post by:
I've started a few threads before on object persistence in medium to high end server apps. This one is about low end apps, for example, a simple cgi on a personal web site that might get a dozen...
6
by: Almad | last post by:
Hello, I'm going to write a custom CMS. I'd like to use some odbms, as code is then much more cleaner...however, i'm a little bit scared about capabilities of ZoDB, when compared with f. e....
10
by: ls | last post by:
Hi All, I looking for help with ZODB data export to text file. I have file Data.fs (file becomes from Plone CMS) and I have to display content structure, data like intro, body of article, etc...
0
by: Almad | last post by:
Hello, sorry for bothering with same question again. However, month ago, I have tried to handle more zodb databases with zeo as described here:...
1
by: Harald Armin Massa | last post by:
Hello, I am using ZODB "standalone" in version 3.3.1 within some application. Now I learn that the 3.3.x branch of ZODB is "retired". No problem so far, everything is running fine. BUT......
3
by: Rene Pijlman | last post by:
I have a productional Linux web server with a Python/Zope/Plone. Now I'd like to install a non-Zope Python/ZODB application on the same server. What is the recommended way of doing that? Option...
0
by: suryanector | last post by:
anybody knows source code for programs using Index file, inverted file operations, usage of B and B++ trees in C++ language plz send them.
2
by: Petra Chong | last post by:
Hello all I am using Python 2.3 and ZODB (without the rest of Zope) with the following pattern: * One process which writes stuff to a ZODB instance (call it test.db) * Another process which...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.