Databases and python

Dan Stromberg

I've been putting a little bit of time into a file indexing engine in
python, which you can find here:
http://dcs.nac.uci.edu/~strombrg/pyindex.html

It'll do 40,000 mail messages of varying lengths pretty well now, but I
want more :)

So far, I've been taking the approach of using a single-table database
like gdbm or dbhash (actually a small number of them, to map filenames to
numbers, numbers to filenames, words to numbers, numbers to words,
and numbered words to numbered filenames), and making each entry keyed by
a word, and under the word in the database is a null terminated list of
filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation) .

However, despite using http://dcs.nac.uci.edu/~strombrg/cachedb.html module
to make the database use faster, bringing in psyco, and studying various
python optimization pages, the program just isn't performing like I'd like
it to.

And I think it's because despite the caching and minimal representation
conversion, it's still just too slow converting linear lists to arrays
back and forth and back and forth.

So this leads me to wonder - is there a python database interface that
would allow me to define a -lot- of tables? Like, each word becomes a
table, and then the fields in that table are just the filenames that
contained that word. That way adding filenames to a word shouldn't bog
down much at all.

-But-, are there any database interfaces for python that aren't going to
get a bit upset if you try to give them hundreds of thousands of tables?

Thanks!

Feb 16 '06 #1

Subscribe Reply

1810

Jonathan Gardner

I'm no expert in BDBs, but I have spent a fair amount of time working
with PostgreSQL and Oracle. It sounds like you need to put some
optimization into your algorithm and data representation.

I would do pretty much like you are doing, except I would only have the
following relations:

- word to word ID
- filename to filename ID
- word ID to filename ID

You're going to want an index on pretty much every column in this
database. That's because you're going to lookup by any one of these
columns for the corresponding value.

I said I wasn't an expert in BDBs. But I do have some experience
building up large databases. In the first stage, you just accumulate
the data. Then you build the indexes only as you need them. Let's say
you are scanning your files. You won't need an index on the
filename-to-ID table. That's because you are just putting data in
there. The word-to-ID table needs an index on the word, but not ID
(you're not looking up by ID yet.) And the word ID-to-filename ID table
doesn't need any indexes yet either. So build up the data without the
indexes. Once your scan is complete, then build up the indexes you'll
need for regular operation. You can probably incrementally add data as
you go.

As far as filename ID and word IDs go, just use a counter to generate
the next number. If you use base255 as the number, you're really not
going to save much space.

And your idea of hundreds of thousands of tables? Very bad. Don't do
it.

Feb 16 '06 #2

Rene Pijlman

Dan Stromberg:

is there a python database interface that would allow me to define a
-lot- of tables? Like, each word becomes a table, and then the fields
in that table are just the filenames that contained that word.

Give ZODB a try.
http://www.zope.org/Wikis/ZODB/FrontPage
http://www.python.org/workshops/2000...ton/zodb3.html

My first attempt would be: a BTree with the word as key, and a 'list of
filenames' as value.
http://www.zope.org/Wikis/ZODB/Front...00000000000000

--
René Pijlman

Feb 16 '06 #3

bruno at modulix

Jonathan Gardner wrote:

I'm no expert in BDBs, but I have spent a fair amount of time working
with PostgreSQL and Oracle. It sounds like you need to put some
optimization into your algorithm and data representation.

I would do pretty much like you are doing, except I would only have the
following relations:

- word to word ID
- filename to filename ID
- word ID to filename ID

You're going to want an index on pretty much every column in this
database.
stop !

I'm not a db expert neither, but putting indexes everywhere is well
known DB antipattern. An index is only useful if the indexed field is
discriminant enough (ie: there must be the less possible records having
the same value for this field). Else, the indexed lookup may end up
taking far more time than a simple linear lookup. Also, indexes slow
down write operations.
That's because you're going to lookup by any one of these
columns for the corresponding value.

I said I wasn't an expert in BDBs. But I do have some experience
building up large databases. In the first stage, you just accumulate
the data. Then you build the indexes only as you need them.
Yes. And only where it makes sens.

(snip)
And your idea of hundreds of thousands of tables? Very bad. Don't do
it.

+100 on this
--
bruno desthuilliers
python -c "print '@'.join(['.'.join([w[::-1] for w in p.split('.')]) for
p in 'o****@xiludom. gro'.split('@')])"

Feb 16 '06 #4

Bryan Olson

Dan Stromberg wrote:

I've been putting a little bit of time into a file indexing engine [...]
So far, I've been taking the approach of using a single-table database
like gdbm or dbhash [...] and making each entry keyed by
a word, and under the word in the database is a null terminated list of
filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation) .
[...] the program just isn't performing like I'd like it to.

And I think it's because despite the caching and minimal representation
conversion, it's still just too slow converting linear lists to arrays
back and forth and back and forth.
I think I follow. The straightforward method of building the
list of files associated with a word is order n**2 time. On each
new entry, you look up the entire string, append one file-id to
it, and write the new version back.
So this leads me to wonder - is there a python database interface that
would allow me to define a -lot- of tables? Like, each word becomes a
table, and then the fields in that table are just the filenames that
contained that word. That way adding filenames to a word shouldn't bog
down much at all.

Well, you could use simple files instead of fancy database tables.

Below is a demo of an alternate technique that uses bsddb B-Trees,
and puts both the word and the file-id in the key. I don't know
how efficient it is for real data, but at least the time won't grow
as Theta(n**2).

--Bryan

import bsddb
import urllib
def add_words_from_ file(index, fname, word_iterator):
""" Pass the open-for-write bsddb B-Tree, a filename, and a list
(or any interable) of the words in the file.
"""
fname = urllib.quote_pl us(fname)
s = set()
for word in word_iterator:
if word not in s:
s.update(word)
key = '%s %s' % (urllib.quote_p lus(word), fname)
index[key] = ''
index.sync()
def lookup(index, word):
""" Pass the B-Tree (as built with add_words_from_ file) and a
word to look up. Returns list of files containing the word.
"""
word = urllib.quote_pl us(word)
fname_list = []
try:
current = index.set_locat ion(word)
while True:
(key, _) = current
(w, fname) = key.split()
if w != word:
break
fname_list.appe nd(urllib.unquo te_plus(fname))
current = index.next()
except KeyError:
pass
return fname_list
def test():
index = bsddb.btopen('j unktest.bdb', 'n')
data =[
('bryfile.txt', 'nor heed the rumble of a distant drum'),
('junkfile.txt' , 'this is the beast, the beast so sly'),
('word file.txt', 'is this the way it always is here in Baltimore')
]
for (fname, text) in data:
words = text.split()
add_words_from_ file(index, fname, words)

for word in ['is', 'the', 'heed', 'this', 'way']:
print '"%s" is in files: %s' % (word, lookup(index, word))

test()

Feb 16 '06 #5

Dan Stromberg

On Thu, 16 Feb 2006 13:45:28 +0000, Bryan Olson wrote:

Dan Stromberg wrote:
I've been putting a little bit of time into a file indexing engine [...]
So far, I've been taking the approach of using a single-table database
like gdbm or dbhash [...] and making each entry keyed by
a word, and under the word in the database is a null terminated list of
filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation) .

[...]
the program just isn't performing like I'd like it to.
>
And I think it's because despite the caching and minimal representation
conversion, it's still just too slow converting linear lists to arrays
back and forth and back and forth.

I think I follow. The straightforward method of building the
list of files associated with a word is order n**2 time. On each
new entry, you look up the entire string, append one file-id to
it, and write the new version back.

Yes, essentially, with the twist that the most recently used words are
kept cached in memory, so they don't have to be converted from string to
list and back to string every time a filename is added to a word.

So this leads me to wonder - is there a python database interface that
would allow me to define a -lot- of tables? Like, each word becomes a
table, and then the fields in that table are just the filenames that
contained that word. That way adding filenames to a word shouldn't bog
down much at all.

Well, you could use simple files instead of fancy database tables.

That's an interesting thought. Perhaps especially if australopitheci ne
were saved in a filename like:

~/indices/au/st/ra/lo/pi/th/ec/in/e
Below is a demo of an alternate technique that uses bsddb B-Trees,
and puts both the word and the file-id in the key. I don't know
how efficient it is for real data, but at least the time won't grow
as Theta(n**2).
Perhaps I'm missing something, but is it not roughly O(1) for
individual insertions, but O(n*m) (n == number of files, m == number of
words) for lookups?
--Bryan

import bsddb
import urllib
def add_words_from_ file(index, fname, word_iterator):
""" Pass the open-for-write bsddb B-Tree, a filename, and a list
(or any interable) of the words in the file.
"""
fname = urllib.quote_pl us(fname)
s = set()
for word in word_iterator:
if word not in s:
s.update(word)
key = '%s %s' % (urllib.quote_p lus(word), fname)
index[key] = ''
index.sync()
def lookup(index, word):
""" Pass the B-Tree (as built with add_words_from_ file) and a
word to look up. Returns list of files containing the word.
"""
word = urllib.quote_pl us(word)
fname_list = []
try:
current = index.set_locat ion(word)
while True:
(key, _) = current
(w, fname) = key.split()
if w != word:
break
fname_list.appe nd(urllib.unquo te_plus(fname))
current = index.next()
except KeyError:
pass
return fname_list
def test():
index = bsddb.btopen('j unktest.bdb', 'n')
data =[
('bryfile.txt', 'nor heed the rumble of a distant drum'),
('junkfile.txt' , 'this is the beast, the beast so sly'),
('word file.txt', 'is this the way it always is here in Baltimore')
]
for (fname, text) in data:
words = text.split()
add_words_from_ file(index, fname, words)

for word in ['is', 'the', 'heed', 'this', 'way']:
print '"%s" is in files: %s' % (word, lookup(index, word))

test()

Feb 17 '06 #6

Dan Stromberg

On Thu, 16 Feb 2006 10:09:42 +0100, Rene Pijlman wrote:

Dan Stromberg:
is there a python database interface that would allow me to define a
-lot- of tables? Like, each word becomes a table, and then the fields
in that table are just the filenames that contained that word.
Give ZODB a try.
http://www.zope.org/Wikis/ZODB/FrontPage
http://www.python.org/workshops/2000...ton/zodb3.html

Aside from object persistence, what would this add for me?

Is each object saved in a separate file, or are they bunched together into
one or more files?
My first attempt would be: a BTree with the word as key, and a 'list of
filenames' as value.
http://www.zope.org/Wikis/ZODB/Front...00000000000000

This is basically what I'm doing now, except the most recently used "lists
of filenames" are kept in RAM to avoid going from "on disk string" to "in
memory list" to "on disk string" quite as much.

Feb 17 '06 #7

Dan Stromberg

On Wed, 15 Feb 2006 23:37:31 -0800, Jonathan Gardner wrote:

I'm no expert in BDBs, but I have spent a fair amount of time working
with PostgreSQL and Oracle. It sounds like you need to put some
optimization into your algorithm and data representation.

I would do pretty much like you are doing, except I would only have the
following relations:

- word to word ID
- filename to filename ID
- word ID to filename ID
I think I could ditch the "word ID to word" table, but I think I'll
probably eventually need the "filename ID to filename" table, so when I
pull a list of filename ID's by word I can get back the filenames.
You're going to want an index on pretty much every column in this
database. That's because you're going to lookup by any one of these
columns for the corresponding value.
I do need some pretty heavy indexing.

In fact, so far everything seems to be clicking along pretty well, except
the part that I'm having trouble indexing - the per-word list of
filenames. And that's because I already have one relation (word
number -> filenames' numbers) that's complicating getting another relation
for the list of filenames (just filename ID to '', but it'd be soooo much
faster if I could nest the relations somehow).
I said I wasn't an expert in BDBs. But I do have some experience
building up large databases. In the first stage, you just accumulate the
data. Then you build the indexes only as you need them. Let's say you
are scanning your files. You won't need an index on the filename-to-ID
table. That's because you are just putting data in there. The word-to-ID
table needs an index on the word, but not ID (you're not looking up by
ID yet.) And the word ID-to-filename ID table doesn't need any indexes
yet either. So build up the data without the indexes. Once your scan is
complete, then build up the indexes you'll need for regular operation.
You can probably incrementally add data as you go.
Interesting thought! I think I need to ponder this a bit more.

I think the filename <-> filename number tables are pretty inexpensive,
and the word <-> word number tables seem pretty inexpensive too. I think
it's primarily the word number to filename numbers table that's messing
up the performance really bad, and that's probably because the list of
filename numbers is growing so large in some cases, and adding a filename
to a word tends to be O(n), n==the number of filenames already on the word.

May I ask: In what representation would you keep this word number to
filename numbers relation stashed away until later, to build your indexes
from more rapidly in a later pass? Are you thinking to keep them all in
memory, or store them in a flat file somehow?
As far as filename ID and word IDs go, just use a counter to generate
the next number. If you use base255 as the number, you're really not
going to save much space.
That's basically what I'm doing - incrementing a counter for each word,
except I'm converting that counter to base255. The base255 representation
of that counter brings the following benefits:

1) Converting a set of numbers to a string, without losing number
boundaries, becomes just mapping them to base255, and
string.joinfiel ds'ing them.

2) Converting a string to a set of numbers, again without losing number
boundaries, is just string.splitfie lds'ing them, and mapping them from
base255 back to a list or set of numbers.

3) The numbers can be arbitrarily large, unlike with a, say, 4 or 8 byte
fixed-length record representation
And your idea of hundreds of thousands of tables? Very bad. Don't do it.

Sigh. I was afraid of that. :-S

I'm thinking though... what if I were to use the current representation
for words that don't appear in that many files, and a table for each word
that has >= 1000 (or whatever threshold) filenames associated with it...?

That should greatly cut down on the number of tables, while still giving
the more efficient representation for the words that need it.

Or not?

Feb 17 '06 #8

Jonathan Gardner

About the filename ID - word ID table: Any good database (good with
large amounts of data) will handle the memory management for you. If
you get enough data, it may make sense to get bothered with PostgreSQL.
That has a pretty good record on handling very large sets of data, and
intermediate sets as well. Again, I can't speak about BDBs, but
something in the back of my head is saying that the entire table is
loaded into memory. If so, that's not good for large sets of data.

About base255: You can think of strings as numbers. At least, that's
how your computer sees them. Converting a number to base255 is really
silly. If it's a series of bytes (like a Python string is) then base255
is a string. You want to use an incremented number for the ID because
there are some indexes that deal with this kind of data very, very
well. If you have your number spread out like a shotgun, with clusters
here and there, the index might not be very good.

About using many tables: The best answer you are going to get is a
bunch of files---one for each word---with the names or IDs of the
files. You can store these in a single directory, or (as you'll find
out to be more efficient) a tree of directories. But this kind of data
is useless for reverse lookups. If you really do have so much data,
start using a real database that is designed to handle this kind of
data efficiently. The overhead of SQL parsing and network latency soon
gets dwarfed by lookup times anyway.

Feb 17 '06 #9

Jonathan Gardner

About indexes everywhere: Yes, you don't have to be a DB expert to know
that indexes everywhere is bad. But look at this example. There are
really two ways that the data is going to get accessed in regular use.
Either they are going to ask for all files that have a word (most
likely) or they are going to see which words are in a file.

I'm going to have to name the tables now, aren't I? Here's a simple
schema:

words
--------
word_id
word

files
------
file_id
filename

word_files
--------------
file_id
word_id

If you are going to lookup by word, you'll need an index on words.word.
You'll also need an index on word_files.word _id. And then you'll need
an index on files.file_id.

If you are going to lookup by file, you'll need an index on
files.filename, word_files.file _id, and words.word_id.

So it ends up in this situation you need indexes everywhere.

Now, when you are doing the initial population, you should drop all the
indexes you don't need during population. That means everything but
words.word has to go. (You'll need to find the word_id for previously
seen words.) After the initial population, then is the time to build
and add the indexes. it's much faster to build an index when you have
the entire set of data in front of you than to do it piece-by-piece.
Some indexes actually get built better than they would've piecemeal.

Unfortunately this is no longer strictly topical to Python. But if you
free your mind from thinking in terms of SQL databases and look at
indexes as dicts or whatnot, then you can see that this is really a
general programming problem.

Feb 17 '06 #10

Similar topics

1323

Databases: Which one's right for me?

by: Marc | last post by:

Hi all, Having never used a database before and immersing myself in reading stuff about databases over the last two days I have come to this conclusion, I still don't know which one I need. I've been using Python for a while, storing things that I'll need later in files. I am now looking for a better solution; ergo the need for a database. Basically I need a db that will travel with my executable script and faithfully store and edit...

Python

5052

MySQLdb returns "ValueError: invalid literal for float():" for cursor.execute('show databases')

by: Tim Williams | last post by:

I'm trying to write a simple python program to access a MySQL database. I'm having a problem with using MySQLdb to get the results of a SQL command in a cursor. Sometimes the cursor.execute works, sometimes not. From mysql: mysql> show databases; +-----------+ | Database |

Python

1564

Win32+Zope+Python+Databases: What are my options?

by: Carl | last post by:

Using COM and ADO on Win32, it is very easy to access databases (eg MySql, MS SQL Server, etc) from Python. Does anybody know if it is possible to access databases from Zope (which is written in Python) using COM and ADO? Carl

Python

1530

About Databases...

by: andrea_gavana | last post by:

Hello NG, I am still quite a newbie with Python (I intensely use wxPython, anyway). I would like to know what are, in your opinions, the best/faster databases that I could use in Python (and, of course, I should be able to "link" everything with a wxPython GUI)? Specifically, I work on Reservoir Simulation, and usually I have to store a discrete/huge amount of data (it depends on the oil field). As you may have understood, I know almost...

Python

2442

Report generator for object databases.

by: Mir Nazim | last post by:

Hi, I m planning to use ZODB for an applicaton. Is any one aware of report generators like Data Vision, Crystal Reports, fo python object databases. Some of you may have faced/solved similar problem some where. Help appreciated. Thanks

Python

1307

Using Databases in Python

by: Petr Jakes | last post by:

I would like to know if anybody can point me to the site, where it is possible to find the tutorial "Using Databases in Python" which is mentioned by Steve Holden here: http://tinyurl.com/ectj8 Thanks Petr Jakes

Python

1424

Databases with python

by: Anthony Irwin | last post by:

Hi All, I am interested in playing with python some more and am looking at writing an app with data stored in a database. I have experience with mysql but thought that their may be other better databases that can be more easily distributed with the program does anyone have any suggestions here? I only use linux myself but I can foresee some windows people wanting to use what I create and if I am going to support windows then I might

Python

1988

Converting old "shelve" databases to gdbm

by: RayOsborn | last post by:

I have some old "shelve" databases created in Python 2.2 that use the old bsddb format, whereas the version of Python 2.4 installed by my web hosting service doesn't have bsddb available at all (or at least, it has, but probably not linked properly to the more recent Sleepy Cat versions of the Berkeley DB - it fails to import _bsddb). Currently, I have to put python2.2 in my shebangs to get the cgi scripts to run at all. The question is...

Python

938

Re: Temporal Databases (Database design questions)

by: M.-A. Lemburg | last post by:

On 2008-06-18 09:41, David wrote: You can have complete history in a database schema by: * adding a version column * adding a modification timestamp (and modification username, if that's relevant for you) * updating the version upon INSERT and UPDATE * have a history table for each "live" table that gets filled using a trigger on the version column which moves

Python

8991

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9544

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9372

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9247

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6074

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4606

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4874

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3313

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2783

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP