473,322 Members | 1,510 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Support for new items in set type

I have a bit of a specialized request.

I'm reading a table of strings (specifically fixed length 36 char
uuids generated via uuid.uuid4() in the standard library) from a file
and creating a set out of it.
Then my program is free to make whatever modifications to this set.

When I go back to save this set, I'd like to be able to only save the
new items. Currently I am creating a copy of the set as soon as I load
it and then when I go back to save it, i'm calculating the difference
and saving just the difference.

There are many problems with this approach so far:
1) Calculating the difference via the standard set implementation is
not very scalable -O(n) I presume
2) Maintaining a copy wastes memory
3) I don't have a good solution if I delete items from the set
(calculating the difference will return an empty set but I need to
actually delete stuff).

I was thinking of writing a specialized set implementation (I have no
idea how to start on this) which tracks new items (added after
initialization) in a separate space and exposes a new API call which
would enable me to directly get those values. This is kind of ugly and
doesn't solve problem 3.

I also thought of using a hastable but I'm storing many ( 1 million)
of these sets in the same file (most of them will be empty or contain
just a few values but a few of them will be very large - excess of
10,000 items). The file is already separated into tablespaces.

Does anyone have any ideas?

Thanks in advance.
Prateek
PS: Yes - I need blazing fast performance - simply pickling/unpickling
won't do. Memory constraints are important but definitely secondary.
Disk space constraints are not very important.

Apr 22 '07 #1
9 1228
Prateek <su*****@gmail.comwrote:
I have a bit of a specialized request.

I'm reading a table of strings (specifically fixed length 36 char
uuids generated via uuid.uuid4() in the standard library) from a file
and creating a set out of it.
Then my program is free to make whatever modifications to this set.

When I go back to save this set, I'd like to be able to only save the
new items. Currently I am creating a copy of the set as soon as I load
it and then when I go back to save it, i'm calculating the difference
and saving just the difference.

There are many problems with this approach so far:
1) Calculating the difference via the standard set implementation is
not very scalable -O(n) I presume
Yep, you do have to look at every item, so O(N) is an obvious lower
bound.
2) Maintaining a copy wastes memory
3) I don't have a good solution if I delete items from the set
(calculating the difference will return an empty set but I need to
actually delete stuff).
(3) is easy -- the difference originalset-finalset is the set of things
you have to delete.
Does anyone have any ideas?
Your problem is really a thinly disguised version of one of the most
ancient and venerable ones in data processing -- the "master file /
transaction file" problem. With sets (which are built as hash tables)
you have very powerful weapons for this fray (it used to be, many years
ago, that sorting and "in-step processing of sorted files" was the only
feasible approach to such problems), but you still have to wield them
properly.

In your shoes, I would write a class whose instances hold three sets:
-- the "master set" is what you originally read from the file
-- the "added set" is the set of things you've added since then
-- the "deleted set" is the set of things you've deleted since them

You can implement a set-like interface; for example, your .add method
will need to remove the item from the deleted set (if it was there),
then add it to the added set (if it wasn't already in the master set);
and so on, for each and every method you actually desire (including no
doubt __contains__ for membership testing). E.g., with suitably obvious
names for the three sets, we could have...:

class cleverset(object):
...
def __contains__(self, item):
if item in self.deleted: return False
elif item in self.added: return True
else: return item in self.master

When the time comes to save back to disk, you'll perform deletions and
additions based on self.deleted and self.added, of course.

I'm not addressing the issue of how to best represent such sets on disk
because it's not obvious to me what operations you need to perform on
the on-disk representation -- for example, deletions can be very costly
unless you can afford to simply set a "logically deleted" flag on
deleted items; but, depending on how often you delete stuff, that will
eventually degrade performance due to fragmentation, and maintaining the
indexing (if you need one) can be a bear.

Normally, I would use a good relational database (which has no doubt
already solved on my behalf all of these issues) for purpose of
persistent storage of such data -- usually sqlite for reasonably small
amounts of data, PostgreSQL if I need enormous scalability and other
"enterprise" features, but I hear other people are happy with many other
engines for relational DBs, such as Oracle (used to be superb if you
could afford full-time specialized db admins), IBM's DB2, SAP's database
(now opensourced), Firebird, even MySQL (I've never, but NEVER, had any
good experience with the latter, but I guess maybe it's only me, as
otherwise it could hardly be so popular). The main point here is that a
relational DB's tables _should_ be already very well optimized for
storing "sets", since those table ARE exactly nothing but sets of tuples
(and a 1-item tuple is a tuple for all that:-).
Alex
Apr 22 '07 #2
On Sat, 21 Apr 2007 20:13:44 -0700, Prateek wrote:
I have a bit of a specialized request.

I'm reading a table of strings (specifically fixed length 36 char
uuids generated via uuid.uuid4() in the standard library) from a file
and creating a set out of it.
Then my program is free to make whatever modifications to this set.

When I go back to save this set, I'd like to be able to only save the
new items.
This may be a silly question, but why? Why not just save the modified set,
new items and old, and not mess about with complicated transactions?

After all, you say:
PS: Yes - I need blazing fast performance - simply pickling/unpickling
won't do. Memory constraints are important but definitely secondary.
Disk space constraints are not very important.
Since disk space is not important, I think that you shouldn't care that
you're duplicating the original items. (Although maybe I'm missing
something.)

Perhaps what you should be thinking about is writing a custom pickle-like
module optimized for reading/writing sets quickly.
--
Steven.

Apr 22 '07 #3
On Apr 22, 11:09 am, Steven D'Aprano
<s...@REMOVE.THIS.cybersource.com.auwrote:
On Sat, 21 Apr 2007 20:13:44 -0700, Prateek wrote:
I have a bit of a specialized request.
I'm reading a table of strings (specifically fixed length 36 char
uuids generated via uuid.uuid4() in the standard library) from a file
and creating a set out of it.
Then my program is free to make whatever modifications to this set.
When I go back to save this set, I'd like to be able to only save the
new items.

This may be a silly question, but why? Why not just save the modified set,
new items and old, and not mess about with complicated transactions?
I tried just that. Basically ignored all the difficulties of
difference calculation and just overwrote the entire tablespace with
the new set. At about 3000 entries per file (and 3 files) along with
all the indexing etc. etc. just the extra I/O cost me 28% performance.
I got 3000 entries committed in 53s with difference calculation but in
68s with writing the whole thing.
>
After all, you say:
PS: Yes - I need blazing fast performance - simply pickling/unpickling
won't do. Memory constraints are important but definitely secondary.
Disk space constraints are not very important.

Since disk space is not important, I think that you shouldn't care that
you're duplicating the original items. (Although maybe I'm missing
something.)

Perhaps what you should be thinking about is writing a custom pickle-like
module optimized for reading/writing sets quickly.
I already did this. I'm not using the pickle module at all - Since I'm
guaranteed that my sets contain a variable number of fixed length
strings, I write a header at the start of each tablespace (using
struct.pack) marking the number of rows and then simply save each
string one after the other without delimiters. I can do this simply by
issuing "".join(list(set_in_question)) and then saving the string
after the header. There are a few more things that I handle (such as
automatic tablespace overflow)

Prateek

Apr 22 '07 #4
2) Maintaining a copy wastes memory
3) I don't have a good solution if I delete items from the set
(calculating the difference will return an empty set but I need to
actually delete stuff).

(3) is easy -- the difference originalset-finalset is the set of things
you have to delete.
Now why didn't I think of that?
Does anyone have any ideas?

Your problem is really a thinly disguised version of one of the most
ancient and venerable ones in data processing -- the "master file /
transaction file" problem. With sets (which are built as hash tables)
you have very powerful weapons for this fray (it used to be, many years
ago, that sorting and "in-step processing of sorted files" was the only
feasible approach to such problems), but you still have to wield them
properly.

In your shoes, I would write a class whose instances hold three sets:
-- the "master set" is what you originally read from the file
-- the "added set" is the set of things you've added since then
-- the "deleted set" is the set of things you've deleted since them

You can implement a set-like interface; for example, your .add method
will need to remove the item from the deleted set (if it was there),
then add it to the added set (if it wasn't already in the master set);
and so on, for each and every method you actually desire (including no
doubt __contains__ for membership testing). E.g., with suitably obvious
names for the three sets, we could have...:

class cleverset(object):
...
def __contains__(self, item):
if item in self.deleted: return False
elif item in self.added: return True
else: return item in self.master

When the time comes to save back to disk, you'll perform deletions and
additions based on self.deleted and self.added, of course.
Ok. This class sounds like a good idea. I'll try it.
I'm not addressing the issue of how to best represent such sets on disk
because it's not obvious to me what operations you need to perform on
the on-disk representation -- for example, deletions can be very costly
unless you can afford to simply set a "logically deleted" flag on
deleted items; but, depending on how often you delete stuff, that will
eventually degrade performance due to fragmentation, and maintaining the
indexing (if you need one) can be a bear.
I actually don't have very many transactions I need to perform on
disk. Once I've loaded these sets, I normally use the contains
operation and union/intersection/difference operations.
I can definitely use a tombstone on the record to mark it as logically
deleted. The problem right now is that I don't need indexing within
sets (I either load the whole set or don't bother at all) - I'm only
indexing the entire file (one entry in the index per tablespace). So
even if I use tombstones, I'll have to still go through all the
tablespace to find the tombstoned files. Alternatively, I could write
the deleted data at the end of the tablespace and mark it with the
tombstone (since I don't care about space so much) and when I load the
data, I can eliminate the tombstoned entries. I'm give that a try and
report my performance results on this thread - I absolutely love that
I can make this important a change in one or two days using Python!
Normally, I would use a good relational database (which has no doubt
already solved on my behalf all of these issues) for purpose of
persistent storage of such data -- usually sqlite for reasonably small
amounts of data, PostgreSQL if I need enormous scalability and other
"enterprise" features, but I hear other people are happy with many other
engines for relational DBs, such as Oracle (used to be superb if you
could afford full-time specialized db admins), IBM's DB2, SAP's database
(now opensourced), Firebird, even MySQL (I've never, but NEVER, had any
good experience with the latter, but I guess maybe it's only me, as
otherwise it could hardly be so popular). The main point here is that a
relational DB's tables _should_ be already very well optimized for
storing "sets", since those table ARE exactly nothing but sets of tuples
(and a 1-item tuple is a tuple for all that:-).
Thanks Alex, but we're actually implementing a (non-relational)
database engine. This particular file is one of many files (which are
all kept in sync by the engine) I need to persist (some files use
other methods - including pickling - for persistence). I've already
solved most of our other problems and according to hotshot, this is
one of the next hurdles (this commit is responsible for 8% of the
total CPU time @ 3000 items and 3 files).
If you're interested, you can check it out at http://www.brainwavelive.com
or email me.

Prateek

Apr 22 '07 #5
In article <11**********************@e65g2000hsc.googlegroups .com>,
Prateek <su*****@gmail.comwrote:
>
Thanks Alex, but we're actually implementing a (non-relational)
database engine.
Why are you reinventing the wheel? Why not just implement your
functionality on top of an existing database as your backing store?
--
Aahz (aa**@pythoncraft.com) <* http://www.pythoncraft.com/

Help a hearing-impaired person: http://rule6.info/hearing.html
Apr 22 '07 #6
Oh dear god, I implemented this and it overall killed performance by
about 50% - 100%. The same script (entering 3000 items) takes between
88 - 109s (it was running in 55s earlier).

Here is the new Set implementation:
class SeaSet(set):
__slots__ = ['master', 'added', 'deleted']
def __init__(self, s=None):
if s is None: s = []
self.master = set(s)
self.added = set()
self.deleted = set()

def add(self, l):
if l not in self.master:
self.added.add(l)
try:
self.deleted.remove(l)
except KeyError:
pass

def remove(self, l):
try:
self.master.remove(l)
self.deleted.add(l)
except KeyError:
try:
self.added.remove(l)
except KeyError:
pass

def __contains__(self, l):
if l in self.deleted:
return False
elif l in self.added:
return True
else:
return l in self.master

def __len__(self):
return len(self.master) + len(self.added)

def __iter__(self):
for x in self.master:
yield x
for x in self.added:
yield x

def __or__(self, s):
return self.union(s)

def __and__(self, s):
return self.intersection(s)

def __sub__(self, s):
return self.difference(s)

def union(self, s):
"""docstring for union"""
if isinstance(s, (set, frozenset)):
return s | self.master | self.added
elif isinstance(s, SeaSet):
return self.master | self.added | s.master | s.added
else:
raise TypeError

def intersection(self, s):
"""docstring for intersection"""
if isinstance(s, (set, frozenset)):
return s & self.master & self.added
elif isinstance(s, SeaSet):
return self.master & self.added & s.master & s.added
else:
raise TypeError

def difference(self, s):
"""docstring for difference"""
if isinstance(s, (set, frozenset)):
self.deleted |= (self.master - s)
self.master -= s
self.added -= s
elif isinstance(s, SeaSet):
t = (s.master | s.deleted)
self.deleted |= self.master - t
self.master -= t
self.added -= t
else:
raise TypeError
The surprising thing is that commits *ARE* running about 50% faster
(according to the time column in the hotshot profiler). But, now, the
longest running operations seem to be the I/O operations which are
taking 10 times longer! (even if they're only reading or writing a few
bytes. Could this have something to do with the set implementation
being in Python as opposed to C?

For instance, this method:
def __readTableHeader(self, f):
hdr = f.read(sz__TABLE_HEADER_FORMAT__)
if len(hdr) < sz__TABLE_HEADER_FORMAT__:
raise EOFError
t = THF_U(hdr)
#t = unpack(__TABLE_HEADER_FORMAT__, hdr)
return t

is now taking 13s when it was taking less than 0.8s before! (same
number of calls, nothing changed except the set implementation)

sz__TABLE_HEADER_FORMAT__ is a constant = struct.calcsize("<II37s") =
45
THF_U is a reference to the struct.unpack function for the relevant
Struct object

What the heck happenned?!

Prateek

Apr 23 '07 #7
En Mon, 23 Apr 2007 02:17:49 -0300, Prateek <su*****@gmail.comescribió:
Oh dear god, I implemented this and it overall killed performance by
about 50% - 100%. The same script (entering 3000 items) takes between
88 - 109s (it was running in 55s earlier).

Here is the new Set implementation:
class SeaSet(set):
[...]
The surprising thing is that commits *ARE* running about 50% faster
(according to the time column in the hotshot profiler). But, now, the
longest running operations seem to be the I/O operations which are
taking 10 times longer! (even if they're only reading or writing a few
bytes. Could this have something to do with the set implementation
being in Python as opposed to C?
Hard to tell - you have posted only your SeaSet implementation, and no I/O
is involved in that code.
For instance, this method:
def __readTableHeader(self, f):
hdr = f.read(sz__TABLE_HEADER_FORMAT__)
if len(hdr) < sz__TABLE_HEADER_FORMAT__:
raise EOFError
t = THF_U(hdr)
#t = unpack(__TABLE_HEADER_FORMAT__, hdr)
return t

is now taking 13s when it was taking less than 0.8s before! (same
number of calls, nothing changed except the set implementation)
I don't see where your SeaSet class is used.

--
Gabriel Genellina
Apr 23 '07 #8
I don't see where your SeaSet class is used.
>
Actually that is the point. According to the hotshot profile, the
problem code doesn't use the SeaSet implementation. Yet that same code
was running much faster earlier. I tried multiple time (2-3 times).
>From what I can fathom, nothing else changed - just the set
implementation.

It seems obvious that the read() call in the __readTableHeader method
is blocking for longer periods although it isn't exactly obvious why
that might be.

I was hoping someone with an idea of Python-C interaction could shed
some light on this. I'm on a different computer right now, I'll log
back in later and post more code if that helps.

Again, thanks to anyone who can help.

Prateek

Apr 23 '07 #9
[Alex Martelli]
In your shoes, I would write a class whose instances hold three sets:
-- the "master set" is what you originally read from the file
-- the "added set" is the set of things you've added since then
-- the "deleted set" is the set of things you've deleted since them
FWIW, I've posted a trial implementation in the Python cookbook:
http://aspn.activestate.com/ASPN/Coo.../Recipe/511496
Raymond Hettinger

Apr 27 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

15
by: Terje Slettebø | last post by:
Hi. I'm new here, and sorry if this has been discussed before; I didn't find it searching the PHP groups. (I've also read recommendations to cross-post to the other PHP groups, but if that is...
5
by: Lionel B | last post by:
Greetings, I am trying to implement "element-wise" arithmetic operators for a class along the following lines (this is a simplified example): // ----- BEGIN CODE ----- struct X { int a,b;
2
by: Irene | last post by:
I'm using Internet Explorer on the W3C site and trying to upload my pages for validation. The error message tells me the content type is not supported and that this has to do with my browser...
31
by: xah | last post by:
what's the pro and con of using <script language="javascript"> vs <script type="text/javascript"> Xah xah@xahlee.org ∑ http://xahlee.org/
51
by: jacob navia | last post by:
I would like to add at the beginning of the C tutorial I am writing a short blurb about what "types" are. I came up with the following text. Please can you comment? Did I miss something? Is...
4
by: mb | last post by:
what is the best way to do this: In a game I want to a class called "Items". This class will have the game items public class Items { public int Chair public int Table . . .and so on . . .
4
by: L | last post by:
Hi there, Does C# support OpenType fonts? My c# application is not recognizing OpenType fonts. Thanks, Lalasa.
0
by: Elhanan | last post by:
hi.. (to be fair this is also in java group , but i wasn't sure where to place this exactly). can i expose this schema and be sure that consumers like dotnet would be able to do xml binding...
5
by: vtjumper | last post by:
I'm building a C# interface to an existing messaging system. The messaging system allows values of several types to be sent/recieved over the interface. What I want to do is use a generic...
21
by: Charles Sullivan | last post by:
I maintain/enhance some inherited FOSS software in C which has compiler options for quite a few different Unix-like operating systems, many of which I've never even heard of. It would be...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.