473,725 Members | 2,193 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

personal document mgmt system idea

Hi folks,

I have been mulling over an idea for a very simple python-based
personal document management system. The source of this possible
solution is the following typical problem:

I accumulate a lot of files (documents, archives, pdfs, images, etc.)
on a daily basis and storing them in a hierarchical file system is
simple but unsatisfactory:

- deeply nested hierarchies are a pain to navigate
and to reorganize
- different file systems have inconsistent and weak schemes
for storing metadata e.g. compare variety of incompatible
schemes in windows alone (office docs vs. pdfs etc.) .

I would like a personal document management system that:

- is of adequate and usable performance
- can accomodate data files of up to 50MB
- is simple and easy to use
- promotes maximum programmibility
- allows for the selective replication (or backup) of data
over a network
- allows for multiple (custom) classification schemes
- is portable across operating systems

The system should promote the following simple pattern:

receive file -> drop it into 'special' folder

after an arbitrary period of doing the above n times -> run
application

for each file in folder:
if automatic metadata extraction is possible:
scan file for metadata and populate fields accordingly
fill in missing metadata
else:
enter metadata
store file

every now and then:
run replicator function of application -> will backup data
over a network
# this will make specified files available to co-workers
# accessing a much larger web-based non-personal version of the
# docmanagement system.

My initial prototyping efforts involved creating a single test table
in
mysql (later to include fields for dublin-core metadata elements)
and a BLOB field for the data itself. My present dev platform is
windows XP pro, mysql 4.1.1-alpha, MySQL-python connector v.0.9.2
and python 2.3.3 . However, I will be testing the same app on Mac OS X
and Linux Mandrake 9.2 as well.

The first problem I've run into is that mysql or the MySQL
connector crashes when the size of one BLOB reaches a certain point:
in this case an .avi file of 7.2 mb .

Here's the code:

<code>

import sys, time, os, zlib
import MySQLdb, _mysql
def initDB(db='test '):
connection = MySQLdb.Connect ("localhost" , "sa")
cursor = connection.curs or()
cursor.execute( "use %s;" % db)
return (connection, cursor)

def close(connectio n, cursor):
connection.clos e()
cursor.close()

def drop_table(curs or):
try:
cursor.execute( "drop table tstable")
except:
pass

def create_table(cu rsor):
cursor.execute( '''create table tstable
( id INTEGER PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(100),
data BLOB
);''')

def process(data):
data = zlib.compress(d ata, 9)
return _mysql.escape_s tring(data)

def populate_table( cursor):
files = [(f, os.path.join('t estdocs', f)) for f in
os.listdir('tes tdocs')]
for filename, filepath in files:
t1 = time.time()
data = open(filepath, 'rb').read()
data = process(data)
# IMPORTANT: you have to quote the binary txt even after
escaping it.
cursor.execute( '''insert into tstable (id, name, data)
values (NULL, '%s', '%s')''' % (filename, data))
print time.time() - t1, 'seconds for ', filepath
def main ():
connection, cursor = initDB()
# doit
drop_table(curs or)
create_table(cu rsor)
populate_table( cursor)
close(connectio n, cursor)
if __name__ == "__main__":
t1 = time.time()
main ()
print '=> it took total ', time.time() - t1, 'seconds to complete'

</code>

<traceback>
pythonw -u "test_blob. py" 0.155999898911 seconds for testdocs\busine ss plan.doc
0.0160000324249 seconds for testdocs\concep t2businessproce ss.pdf
0.0160000324249 seconds for testdocs\diagra m.vsd
0.0149998664856 seconds for testdocs\logo.j pg
Traceback (most recent call last):
File "test_blob. py", line 59, in ?
main ()
File "test_blob. py", line 53, in main
populate_table( cursor)
File "test_blob. py", line 44, in populate_table
cursor.execute( '''insert into tstable (id, name, data) values
(NULL, '%s', '%s')''' % (filename, data))
File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\cursors.py",
line 95, in execute
return self._execute(q uery, args)
File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\cursors.py",
line 114, in _execute
self.errorhandl er(self, exc, value)
File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\connections.p y",
line 33, in defaulterrorhan dler
raise errorclass, errorvalue
_mysql_exceptio ns.OperationalE rror: (2006, 'MySQL server has gone
away')Exit code: 1


</traceback>

My Questions are:

- Is my test code at fault?

- Is this the wrong approach to begin with: i.e. is it a bad idea to
store the data itself in the database?

- Am I using the wrong database? (or is the connector just buggy?)
Thanks to all.

best regards,

Sandy Norton
Jul 18 '05 #1
8 2225
sa********@hotm ail.com (Sandy Norton) writes:
I have been mulling over an idea for a very simple python-based
personal document management system. The source of this possible
solution is the following typical problem:

I accumulate a lot of files (documents, archives, pdfs, images, etc.)
on a daily basis and storing them in a hierarchical file system is
simple but unsatisfactory:

- deeply nested hierarchies are a pain to navigate
and to reorganize
- different file systems have inconsistent and weak schemes
for storing metadata e.g. compare variety of incompatible
schemes in windows alone (office docs vs. pdfs etc.) .

I would like a personal document management system that: [...] The system should promote the following simple pattern: [...]

Pybliographer 2 is aiming at these features (but a lot more besides).
Work has been slow for a long while, but several new releases of
pyblio 1 have come out recently, and work is taking place on pyblio 2.
There are design documents on the web at pybliographer.o rg. Why not
muck in and implement what you want with Pyblio?

[...] My initial prototyping efforts involved creating a single test table
in
mysql (later to include fields for dublin-core metadata elements)
and a BLOB field for the data itself. My present dev platform is
windows XP pro, mysql 4.1.1-alpha, MySQL-python connector v.0.9.2
and python 2.3.3 . However, I will be testing the same app on Mac OS X
and Linux Mandrake 9.2 as well.
ATM Pyblio only runs on GNOME, but that's going to change.

The first problem I've run into is that mysql or the MySQL
connector crashes when the size of one BLOB reaches a certain point:
in this case an .avi file of 7.2 mb .

Here's the code: [...] _mysql_exceptio ns.OperationalE rror: (2006, 'MySQL server has gone
away')
Exit code: 1


</traceback>

My Questions are:

- Is my test code at fault?

- Is this the wrong approach to begin with: i.e. is it a bad idea to
store the data itself in the database?


Haven't read your code, but the error certainly strongly suggests a
MySQL configuration problem.
John
Jul 18 '05 #2
I wouldn't put the individual files in a data base - that's what
file systems are for. The exception is small files (and by the
time you say ".doc" in MS Word, it's now longer a small
file) where you can save substantial space by consolidating
them.

John Roth

"Sandy Norton" <sa********@hot mail.com> wrote in message
news:b0******** *************** **@posting.goog le.com...
Hi folks,

I have been mulling over an idea for a very simple python-based
personal document management system. The source of this possible
solution is the following typical problem:

I accumulate a lot of files (documents, archives, pdfs, images, etc.)
on a daily basis and storing them in a hierarchical file system is
simple but unsatisfactory:

- deeply nested hierarchies are a pain to navigate
and to reorganize
- different file systems have inconsistent and weak schemes
for storing metadata e.g. compare variety of incompatible
schemes in windows alone (office docs vs. pdfs etc.) .

I would like a personal document management system that:

- is of adequate and usable performance
- can accomodate data files of up to 50MB
- is simple and easy to use
- promotes maximum programmibility
- allows for the selective replication (or backup) of data
over a network
- allows for multiple (custom) classification schemes
- is portable across operating systems

The system should promote the following simple pattern:

receive file -> drop it into 'special' folder

after an arbitrary period of doing the above n times -> run
application

for each file in folder:
if automatic metadata extraction is possible:
scan file for metadata and populate fields accordingly
fill in missing metadata
else:
enter metadata
store file

every now and then:
run replicator function of application -> will backup data
over a network
# this will make specified files available to co-workers
# accessing a much larger web-based non-personal version of the
# docmanagement system.

My initial prototyping efforts involved creating a single test table
in
mysql (later to include fields for dublin-core metadata elements)
and a BLOB field for the data itself. My present dev platform is
windows XP pro, mysql 4.1.1-alpha, MySQL-python connector v.0.9.2
and python 2.3.3 . However, I will be testing the same app on Mac OS X
and Linux Mandrake 9.2 as well.

The first problem I've run into is that mysql or the MySQL
connector crashes when the size of one BLOB reaches a certain point:
in this case an .avi file of 7.2 mb .

Here's the code:

<code>

import sys, time, os, zlib
import MySQLdb, _mysql
def initDB(db='test '):
connection = MySQLdb.Connect ("localhost" , "sa")
cursor = connection.curs or()
cursor.execute( "use %s;" % db)
return (connection, cursor)

def close(connectio n, cursor):
connection.clos e()
cursor.close()

def drop_table(curs or):
try:
cursor.execute( "drop table tstable")
except:
pass

def create_table(cu rsor):
cursor.execute( '''create table tstable
( id INTEGER PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(100),
data BLOB
);''')

def process(data):
data = zlib.compress(d ata, 9)
return _mysql.escape_s tring(data)

def populate_table( cursor):
files = [(f, os.path.join('t estdocs', f)) for f in
os.listdir('tes tdocs')]
for filename, filepath in files:
t1 = time.time()
data = open(filepath, 'rb').read()
data = process(data)
# IMPORTANT: you have to quote the binary txt even after
escaping it.
cursor.execute( '''insert into tstable (id, name, data)
values (NULL, '%s', '%s')''' % (filename, data))
print time.time() - t1, 'seconds for ', filepath
def main ():
connection, cursor = initDB()
# doit
drop_table(curs or)
create_table(cu rsor)
populate_table( cursor)
close(connectio n, cursor)
if __name__ == "__main__":
t1 = time.time()
main ()
print '=> it took total ', time.time() - t1, 'seconds to complete'

</code>

<traceback>
pythonw -u "test_blob. py"

0.155999898911 seconds for testdocs\busine ss plan.doc
0.0160000324249 seconds for testdocs\concep t2businessproce ss.pdf
0.0160000324249 seconds for testdocs\diagra m.vsd
0.0149998664856 seconds for testdocs\logo.j pg
Traceback (most recent call last):
File "test_blob. py", line 59, in ?
main ()
File "test_blob. py", line 53, in main
populate_table( cursor)
File "test_blob. py", line 44, in populate_table
cursor.execute( '''insert into tstable (id, name, data) values
(NULL, '%s', '%s')''' % (filename, data))
File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\cursors.py",
line 95, in execute
return self._execute(q uery, args)
File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\cursors.py",
line 114, in _execute
self.errorhandl er(self, exc, value)
File "C:\Engines\Pyt hon23\Lib\site-packages\MySQLd b\connections.p y",
line 33, in defaulterrorhan dler
raise errorclass, errorvalue
_mysql_exceptio ns.OperationalE rror: (2006, 'MySQL server has gone
away')
Exit code: 1


</traceback>

My Questions are:

- Is my test code at fault?

- Is this the wrong approach to begin with: i.e. is it a bad idea to
store the data itself in the database?

- Am I using the wrong database? (or is the connector just buggy?)
Thanks to all.

best regards,

Sandy Norton

Jul 18 '05 #3
Sandy Norton wrote:

Hi Sandy,

looks like this will be the year of personal document management projects.
Since I'm involved in a similar project (hope I can go Open Source with it),
here are some of my thoughts.
Hi folks,

I have been mulling over an idea for a very simple python-based
personal document management system. The source of this possible
solution is the following typical problem:

[...]

The first problem I've run into is that mysql or the MySQL
connector crashes when the size of one BLOB reaches a certain point:
in this case an .avi file of 7.2 mb .


Just dump your files somewhere in the filesystem and keep a record of it in
your database.

In addition, a real (text) search engine might be of help. I'm using swish-e
(www.swish-e.org) and are very pleased with it.

Maybe, before you invest to much time into such a project, you should check
out the following:

Chandler (http://www.osafoundation.org)
if it's finished, it will do excactly what you are aiming for (and it's
written in Python)

ReiseFS (see www.namesys.com -> Future Vision)

Gnome Storage (http://www.gnome.org/~seth/storage)

WinFS
(http://msdn.microsoft.com/Longhorn/u...S/default.aspx)

Hope that helps

Stephan

Jul 18 '05 #4
John J. Lee:
Pybliographer 2 is aiming at these features (but a lot more besides).
Work has been slow for a long while, but several new releases of
pyblio 1 have come out recently, and work is taking place on pyblio 2.
There are design documents on the web at pybliographer.o rg. Why not
muck in and implement what you want with Pyblio?


Thanks for the reference, Pyblio definitely seems interesting and I
will be looking into this project closely.

cheers.

Sandy
Jul 18 '05 #5
John Roth wrote :
I wouldn't put the individual files in a data base - that's what
file systems are for. The exception is small files (and by the
time you say ".doc" in MS Word, it's now longer a small
file) where you can save substantial space by consolidating
them.


There seems to be consensus that I shouldn't store files in the
database. This makes sense as filesystems seem to be optimized for,
um, files (-;

As I want to get away from deeply nested directories, I'm going to
test two approaches:

1. store everything in a single folder and hash each file name to give
a unique id

2. create a directory structure based upon a calendar year and store
the daily downloads automatically.

I can finally use some code I'd written before for something like this
purpose:

<code>

from pprint import pprint
import os
import calendar
class Calendirs:

months = {
1 : 'January',
2 : 'February',
3 : 'March',
4 : 'April',
5 : 'May',
6 : 'June',
7 : 'July',
8 : 'August',
9 : 'September',
10 : 'October',
11 : 'November',
12 : 'December'
}

wkdays = {
0 : 'Monday',
1 : 'Tuesday',
2 : 'Wednesday',
3 : 'Thursday',
4 : 'Friday',
5 : 'Saturday',
6 : 'Sunday'
}

def __init__(self, year):
self.year = year

def calendir(self):
'''returns list of calendar matrices'''
mc = calendar.monthc alendar
cal = [(self.year, m) for m in range(1,13)]
return [mc(y,m) for (y, m) in cal]

def yearList(self):
res=[]
weekday = calendar.weekda y
m = 0
for month in self.calendir() :
lst = []
m += 1
for week in month:
for day in week:
if day:
day_str = Calendirs.wkday s[weekday(self.ye ar,
m, day)]
lst.append( (str(m)+'.'+Cal endirs.months[m],
str(day)+'.'+da y_str) )
res.append(lst)
return res

def make(self):
for month in self.yearList() :
for m, day in month:
path = os.path.join(st r(self.year), m, day)
os.makedirs(pat h)

Calendirs(2004) .make()

</code>
I don't know which method will perform better or be more usable...
testing testing testing.

regards,

Sandy
Jul 18 '05 #6
Stephan Diehl wrote:

[...]
Just dump your files somewhere in the filesystem and keep a record of it in
your database.
I think I will go with this approach. (see other posting for details)

In addition, a real (text) search engine might be of help. I'm using swish-e
(www.swish-e.org) and are very pleased with it.
Just downloaded it... looks good. Now if it also had a python api (-;
Maybe, before you invest to much time into such a project, you should check
out the following:

Chandler (http://www.osafoundation.org)
if it's finished, it will do excactly what you are aiming for (and
it's written in Python)
Still early stages... I see they dropped the ZODB.
ReiseFS (see www.namesys.com -> Future Vision)
Gnome Storage (http://www.gnome.org/~seth/storage)
WinFS
(http://msdn.microsoft.com/Longhorn/u...S/default.aspx)

Wow! Very exciting stuff... I guess we'll just have to wait and see what develops.

Hope that helps
Yes. Very informative. Cheers for the help.
Stephan


Sandy
Jul 18 '05 #7
Have you looked at the modules available from divmod.org for your text
searching?

Sandy Norton wrote:
Stephan Diehl wrote:

[...]
Just dump your files somewhere in the filesystem and keep a record of it in
your database.


I think I will go with this approach. (see other posting for details)

In addition, a real (text) search engine might be of help. I'm using swish-e
(www.swish-e.org) and are very pleased with it.


Just downloaded it... looks good. Now if it also had a python api (-;
Maybe, before you invest to much time into such a project, you should check
out the following:

Chandler (http://www.osafoundation.org)
if it's finished, it will do excactly what you are aiming for (and
it's written in Python)


Still early stages... I see they dropped the ZODB.
ReiseFS (see www.namesys.com -> Future Vision)
Gnome Storage (http://www.gnome.org/~seth/storage)
WinFS
(http://msdn.microsoft.com/Longhorn/u...S/default.aspx)

Wow! Very exciting stuff... I guess we'll just have to wait and see what develops.

Hope that helps


Yes. Very informative. Cheers for the help.
Stephan


Sandy


Jul 18 '05 #8
Sandy Norton wrote:
Stephan Diehl wrote:

[...]
[...]
In addition, a real (text) search engine might be of help. I'm using
swish-e (www.swish-e.org) and are very pleased with it.


Just downloaded it... looks good. Now if it also had a python api (-;


I'm just using the command line interface via os.system and the popenX
calls.
The only thing that (unfortunatelly ) not possible, is to remove a document
from the index :-(
If you need any help, just drop me a line.
Maybe, before you invest to much time into such a project, you should
check out the following:

Chandler (http://www.osafoundation.org)
if it's finished, it will do excactly what you are aiming for
(and it's written in Python)
Still early stages... I see they dropped the ZODB.


Did they? If they succeed, Chandler will rock. My personal opinion is that
they try doing too much at once. I guess that a better filesystem will make
most of the document management type applications obsolete.
The big problem, of course, is to define 'better' in a meaningfull way.
ReiseFS (see www.namesys.com -> Future Vision)
Gnome Storage (http://www.gnome.org/~seth/storage)
WinFS
(http://msdn.microsoft.com/Longhorn/u...S/default.aspx)

Wow! Very exciting stuff... I guess we'll just have to wait and see what
develops.
Or go the other way: build a new filesystem prototype application in python
and see, if it works out as intended and then build a proper file system.

Hope that helps


Yes. Very informative. Cheers for the help.
Stephan


Sandy


Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
1722
by: DaveJohnson12 | last post by:
I installed PWS 4.0 on my Windows 95 computer. It installed and ran nicely without any problems for several months. Then I stopped using it for a while. Now that I need it again it won't start. The Personal Web Manager program runs and when I click on Start to start the server running nothing happens now, not even an error message. I see a message saying "Web Publishing is Off". This should change to "Web Publishing is On" but it doesn't....
2
15225
by: Ken Lindner | last post by:
I have a need to become familiar with SQL Server 2000 for work. Needless to say I am new to SQL Server any version, but not IT in general. My employer has provided me with the SQL Server 2000 Personal disk from the SQL Server 2000 Enterprise kit as this is reported here on the MSDN web site to be the version that is supported on Windows XP. In fact so many of you kind people confess to having succeeded in doing it. I have tried...
4
3373
by: Jorge_Beteta | last post by:
Hello, I work for an attorney staff, so basically the whole day they are sending themselves a lot and a lot of papers (Word docs or Excell sheets). We are going to propose them the use of a Document Exchange and Database System (like Lotus Notes or Oracle Managment Content, formerly known as Internet File System IFS ) with .NET interfazes. Basically the idea is that instead of sending papers the whole day,
27
2075
by: John Bailo | last post by:
The recent quarterly earnings report, required by law, issued by the Microsoft Corporation are a harbinger of what is to come. Slowing revenue growth, and declining profits. What we see here is a Chrysler, circa 1973, in the making. A bloated heavy industry giant unable to compete in the 21st century marketplace and using legal tricks, government subsidies and public relations to prop up its decaying position. Bill Gate is a Chairman...
10
54902
by: InvisibleMan | last post by:
Hi, Thanks for any help in advance... Okay, I have the JS listed below that calls for the display of the (DIV) tag... cookie function not included, as don't feel its necessary but you'll get the idea! function closeall() { var objs;
136
9425
by: Matt Kruse | last post by:
http://www.JavascriptToolbox.com/bestpractices/ I started writing this up as a guide for some people who were looking for general tips on how to do things the 'right way' with Javascript. Their code was littered with document.all and eval, for example, and I wanted to create a practical list of best practices that they could easily put to use. The above URL is version 1.0 (draft) that resulted. IMO, it is not a replacement for the FAQ,...
1
3614
by: Jerry | last post by:
I have an existing C++ Wmi App developed with v6.0 that I am now converting to C#. To do so I need FCL Namespace System.Management but "Using System.Management;" bombs out with a build error saying "Management does not exist in namespace System". I have all the latest SDKs, am using VS .Net Pro, and my C++ Wmi App (built with v6.0) works just fine.
0
2613
by: Larry Rebich | last post by:
I am trying to use the sample web template that is distributed with Visual Studio 2005 called 'Personal Web Site Starter Kit'. After running for awhile it fails with the following message: ------------------------------------------ Server Error in '/PersonalWebSiteStarter' Application. -------------------------------------------------------------------------------- An error has occurred while establishing a connection to the server. ...
2
1180
by: maheswaran | last post by:
HI Friends, Any one tell me a best open source document for php/net (excepts knowledge tree,doc man) with source code... Pls urgent one.....
0
8888
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9401
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9257
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9176
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8097
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
4519
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4784
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3221
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2635
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.