By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,235 Members | 1,011 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,235 IT Pros & Developers. It's quick & easy.

How can I create customized classes that have similar properties as'str'?

P: n/a
I mean, all the class instances that equal to each other should be
reduced into only one instance, which means for instances of this
class there's no difference between a is b and a==b.

Thank you.
Nov 24 '07 #1
Share this Question
Share on Google+
25 Replies


P: n/a
I find myself frequently in need of classes like this for two reasons.
First, it's efficient in memory. Second, when two instances are
compared for equality only their pointers are compared. (I think
that's how Python compares 'str's.

On Nov 24, 6:31 pm, Licheng Fang <fanglich...@gmail.comwrote:
I mean, all the class instances that equal to each other should be
reduced into only one instance, which means for instances of this
class there's no difference between a is b and a==b.

Thank you.
Nov 24 '07 #2

P: n/a
Licheng Fang wrote:
I mean, all the class instances that equal to each other should be
reduced into only one instance, which means for instances of this
class there's no difference between a is b and a==b.
If you only want that if "a == b" is True also "a is b" is True,
overload the is_ attribute of your class. Personally, I don't see
any advantage in this.

Regards,
Björn

--
BOFH excuse #352:

The cables are not the same length.

Nov 24 '07 #3

P: n/a
Licheng Fang wrote:
I find myself frequently in need of classes like this for two
reasons. First, it's efficient in memory.
Are you using millions of objects, or MB size objects? Otherwise,
this is no argument.

BTW, what happens if you, by some operation, make a == b, and
afterwards change b so another object instance must be created?
This instance management is quite a runtime overhead.
Second, when two instances are compared for equality only their
pointers are compared.
I state that the object management will often eat more performance
than equality testing. Except you have a huge number of equal
objects. If the latter was the case you should rethink your program
design.
(I think that's how Python compares 'str's.
Generally not. In CPython, just very short strings are created only
once.
>>a=" "
b=" "
a is b
True
>>a=" "
b=" "
a is b
False

Regards,
Björn

--
BOFH excuse #430:

Mouse has out-of-cheese-error

Nov 24 '07 #4

P: n/a
On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
mail-0306.20.chr0n...@spamgourmet.comwrote:
Licheng Fang wrote:
I find myself frequently in need of classes like this for two
reasons. First, it's efficient in memory.

Are you using millions of objects, or MB size objects? Otherwise,
this is no argument.
Yes, millions. In my natural language processing tasks, I almost
always need to define patterns, identify their occurrences in a huge
data, and count them. Say, I have a big text file, consisting of
millions of words, and I want to count the frequency of trigrams:

trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]

I can save the counts in a dict D1. Later, I may want to recount the
trigrams, with some minor modifications, say, doing it on every other
line of the input file, and the counts are saved in dict D2. Problem
is, D1 and D2 have almost the same set of keys (trigrams of the text),
yet the keys in D2 are new instances, even though these keys probably
have already been inserted into D1. So I end up with unnecessary
duplicates of keys. And this can be a great waste of memory with huge
input data.
>
BTW, what happens if you, by some operation, make a == b, and
afterwards change b so another object instance must be created?
This instance management is quite a runtime overhead.
I probably need this class to be immutable.
Second, when two instances are compared for equality only their
pointers are compared.

I state that the object management will often eat more performance
than equality testing. Except you have a huge number of equal
objects. If the latter was the case you should rethink your program
design.
Yeah, counting is all about equal or not.
(I think that's how Python compares 'str's.

Generally not. In CPython, just very short strings are created only
once.
>a=" "
b=" "
a is b
True
>a=" "
b=" "
a is b
Wow, I didn't know this. But exactly how Python manage these strings?
My interpretator gave me such results:
>>a = 'this'
b = 'this'
a is b
True
>>a = 'this is confusing'
b = 'this is confusing'
a is b
False

False

Regards,

Björn

--
BOFH excuse #430:

Mouse has out-of-cheese-error
Nov 24 '07 #5

P: n/a
Licheng Fang wrote:
On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
>BTW, what happens if you, by some operation, make a == b, and
afterwards change b so another object instance must be created?
This instance management is quite a runtime overhead.

I probably need this class to be immutable.
IMHO you don't need a change of Python, but simply a special
implementation (probably using metaclasses/singletons).
Wow, I didn't know this. But exactly how Python manage these
strings?
I don't know (use the source, Luke). :) Or perhaps there is a Python
Elder here that knows?

Regards,
Björn

--
BOFH excuse #165:

Backbone Scoliosis

Nov 24 '07 #6

P: n/a
On Nov 24, 5:44 am, Licheng Fang <fanglich...@gmail.comwrote:
Yes, millions. In my natural language processing tasks, I almost
always need to define patterns, identify their occurrences in a huge
data, and count them. [...] So I end up with unnecessary
duplicates of keys. And this can be a great waste of memory with huge
input data.
create a hash that maps your keys to themselves, then use the values
of that hash as your keys.
>>store = {}
def atom(str):
global store
if str not in store:
store[str] = str
return store[str]
>>a='this is confusing'
b='this is confusing'
a == b
True
>>a is b
False
>>atom(a) is atom(b)
True
Nov 24 '07 #7

P: n/a
On Sat, 24 Nov 2007 13:40:40 +0100, Bjoern Schliessmann wrote:
Licheng Fang wrote:
>On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
>Wow, I didn't know this. But exactly how Python manage these
strings?

I don't know (use the source, Luke). :) Or perhaps there is a Python
Elder here that knows?
AFAIK strings of length 1 and strings that would be valid Python
identifiers are treated this way.

Ciao,
Marc 'BlackJack' Rintsch
Nov 24 '07 #8

P: n/a
On Nov 24, 9:42 pm, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
On Sat, 24 Nov 2007 13:40:40 +0100, Bjoern Schliessmann wrote:
Licheng Fang wrote:
On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
Wow, I didn't know this. But exactly how Python manage these
strings?
I don't know (use the source, Luke). :) Or perhaps there is a Python
Elder here that knows?

AFAIK strings of length 1 and strings that would be valid Python
identifiers are treated this way.

Ciao,
Marc 'BlackJack' Rintsch
Thanks. Then, is there a way to make python treat all strings this
way, or any other kind of immutable objects?
Nov 24 '07 #9

P: n/a
Licheng Fang a écrit :
I mean, all the class instances that equal to each other should be
reduced into only one instance, which means for instances of this
class there's no difference between a is b and a==b.
Here's a Q&D attempt - without any garantee, and to be taylored to your
needs.

_values = {} #id(instance) =value mapping
_instances = {} #hash(value) =instance mapping

class Value(object):
def __new__(cls, value):
try:
return _instances[hash(value)]
except KeyError:
instance = object.__new__(cls)
_values[id(instance)] = value
_instances[hash(value)] = instance
return instance

@apply
def value():
def fget(self):
return _values[id(self)]
def fset(self, ignore):
raise AttributeError("%s.value is read only" % type(self))
def fdel(self):
raise AttributeError("%s.value is read only" % type(self))
return property(**locals())
HTH
Nov 24 '07 #10

P: n/a
On Nov 24, 10:35 am, Licheng Fang <fanglich...@gmail.comwrote:
Thanks. Then, is there a way to make python treat all strings this
way, or any other kind of immutable objects?
The word generally used is 'atom' when referring to strings that are
set up such that 'a == b' implies 'a is b'. This is usually an
expensive process, since you don't want to do it to strings that are,
e.g., read from a file. Yes, it could be done only for string
constants, and some languages (starting with LISP) do this, but that
isn't what you (or most people) want. Whether you realize it or not,
you want control over the process; in your example, you don't want to
do it for the lines read from your file, just the trigrams.

The example that I gave does exactly that. It adds a fixed amount of
storage for each string that you 'intern' (the usual name given to the
process of generating such a string. Let's look at my code again:
>>store = {}
def atom(str):
global store
if str not in store:
store[str] = str
return store[str]

Each string passed to 'atom' already exists. We look to see if copy
already exists; if so we can discard the latest instance and use that
copy henceforth. If a copy does not exist, we save the string inside
'store'. Since the string already exists, we're just increasing its
reference count by two (so it won't be reference counted) and
increasing the size of 'store' by (an amortized) pair of pointers to
that same string.
Nov 24 '07 #11

P: n/a
On Nov 24, 5:44 am, Licheng Fang <fanglich...@gmail.comwrote:
Yes, millions. In my natural language processing tasks, I almost
always need to define patterns, identify their occurrences in a huge
data, and count them. Say, I have a big text file, consisting of
millions of words, and I want to count the frequency of trigrams:

trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]
BTW, if the components of your trigrams are never larger than a byte,
then encode the tuples as integers and don't worry about pointer
comparisons.
>>def encode(s):
return (ord(s[0])*256+ord(s[1]))*256+ord(s[2])
>>def trigram(s):
return [ encode(s[i:i+3]) for i in range(0, len(s)-2)]
>>trigram('abcde')
[6382179, 6447972, 6513765]
Nov 24 '07 #12

P: n/a
On Sat, 24 Nov 2007 04:54:43 -0800, samwyse wrote:
create a hash that maps your keys to themselves, then use the values of
that hash as your keys.
>>>store = {}
def atom(str):
global store
if str not in store:
store[str] = str
return store[str]
Oh lordy, that's really made my day! That's the funniest piece of code
I've seen for a long time! Worthy of being submitted to the DailyWTF.

Samwyse, while I applaud your willingness to help, I think you actually
need to get some programming skills before doing so. Here's a hint to get
you started: can you think of a way to optimize that function so it does
less work?

--
Steven.
Nov 24 '07 #13

P: n/a
On Nov 24, 4:59 pm, Steven D'Aprano

<st...@REMOVE-THIS-cybersource.com.auwrote:
On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:
On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
mail-0306.20.chr0n...@spamgourmet.comwrote:
Licheng Fang wrote:
I find myself frequently in need of classes like this for two
reasons. First, it's efficient in memory.
Are you using millions of objects, or MB size objects? Otherwise, this
is no argument.
Yes, millions.

Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of
megabytes!!!!1! How will a PC with one or two gigabytes of RAM cope?????
Comments like these make one wonder if your real life experience with
massive data matches even the one tenth of your self-importance and
need to be snarky in most of your posts.

To the OP: yes, your use case is quite valid; the keyword you are
looking for is "memoize". You can find around a dozen of recipes in
the Cookbook and posted in this list; here's one starting point:
http://aspn.activestate.com/ASPN/Coo.../Recipe/413717.

HTH,
George
Nov 24 '07 #14

P: n/a
Steven D'Aprano wrote:
No advantage? That's for sure. There is no is_ attribute of
generic classes, and even if there was, it would have no special
meaning.
Argl, I confused the operator module's attributes with objects ;)

Regards,
Björn

--
BOFH excuse #378:

Operators killed by year 2000 bug bite.

Nov 24 '07 #15

P: n/a
On Sat, 24 Nov 2007 14:58:50 -0800, George Sakkis wrote:
On Nov 24, 4:59 pm, Steven D'Aprano

<st...@REMOVE-THIS-cybersource.com.auwrote:
>On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:
On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
mail-0306.20.chr0n...@spamgourmet.comwrote:
Licheng Fang wrote:
I find myself frequently in need of classes like this for two
reasons. First, it's efficient in memory.
>Are you using millions of objects, or MB size objects? Otherwise,
this is no argument.
Yes, millions.

Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of
megabytes!!!!1! How will a PC with one or two gigabytes of RAM
cope?????

Comments like these make one wonder if your real life experience with
massive data matches even the one tenth of your self-importance and need
to be snarky in most of your posts.
I cheerfully admit to never needing to deal with "massive data".

However, I have often needed to deal with tens and hundreds of megabytes
of data, which IS NOT MASSIVE amounts of data to deal with on modern
systems. Which was my point.

To the OP: yes, your use case is quite valid; the keyword you are
looking for is "memoize". You can find around a dozen of recipes in the
Cookbook and posted in this list; here's one starting point:
http://aspn.activestate.com/ASPN/Coo.../Recipe/413717.
This has nothing, absolutely NOTHING, to do with memoization. Memoization
trades off memory for time, allowing slow functions to return results
faster at the cost of using more memory. The OP wants to save memory, not
use more of it.

--
Steven.
Nov 25 '07 #16

P: n/a
samwyse <sa*****@gmail.comwrites:
create a hash that maps your keys to themselves, then use the values
of that hash as your keys.
The "atom" function you describe already exists under the name
"intern".
Nov 25 '07 #17

P: n/a
On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:
samwyse <sa*****@gmail.comwrites:
>create a hash that maps your keys to themselves, then use the values of
that hash as your keys.

The "atom" function you describe already exists under the name "intern".
Not really. intern() works very differently, because it can tie itself to
the Python internals. Samwyse's atom() function doesn't, and so has no
purpose.
In any case, I'm not sure that intern() actually will solve the OP's
problem, even assuming it is a real and not imaginary problem. According
to the docs, intern()'s purpose is to speed up dictionary lookups, not to
save memory. I suspect that if it does save memory, it will be by
accident.

From the docs:
http://docs.python.org/lib/non-essen...-in-funcs.html

intern( string)
Enter string in the table of ``interned'' strings and return the interned
string - which is string itself or a copy. Interning strings is useful to
gain a little performance on dictionary lookup - if the keys in a
dictionary are interned, and the lookup key is interned, the key
comparisons (after hashing) can be done by a pointer compare instead of a
string compare. Normally, the names used in Python programs are
automatically interned, and the dictionaries used to hold module, class
or instance attributes have interned keys. Changed in version 2.3:
Interned strings are not immortal (like they used to be in Python 2.2 and
before); you must keep a reference to the return value of intern() around
to benefit from it.
Note the words "which is string itself or a copy". It would be ironic if
the OP uses intern to avoid having copies of strings, and ends up with
even more copies than if he didn't bother.

I guess he'll actually need to measure his memory consumption and see
whether he actually has a memory problem or not, right?
--
Steven.
Nov 25 '07 #18

P: n/a
Steven D'Aprano <st***@REMOVE-THIS-cybersource.com.auwrites:
On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:
>samwyse <sa*****@gmail.comwrites:
>>create a hash that maps your keys to themselves, then use the values of
that hash as your keys.

The "atom" function you describe already exists under the name "intern".

Not really. intern() works very differently, because it can tie itself to
the Python internals.
The exact implementation mechanism is subtly different, but
functionally intern is equivalent to the "atom" function.
In any case, I'm not sure that intern() actually will solve the OP's
problem, even assuming it is a real and not imaginary
problem. According to the docs, intern()'s purpose is to speed up
dictionary lookups, not to save memory. I suspect that if it does
save memory, it will be by accident.
It's not by accident, it follows from what interning does. Interning
speeds up comparisons by returning the same string object for the same
string contents. If the strings you're working with tend to repeat,
interning will save some memory simply by preventing storage of
multiple copies of the same string. Whether the savings would make
any difference for the OP is another question.
From the docs:
http://docs.python.org/lib/non-essen...-in-funcs.html

intern( string)
Enter string in the table of ``interned'' strings and return the interned
string - which is string itself or a copy. [...]

Note the words "which is string itself or a copy". It would be ironic if
the OP uses intern to avoid having copies of strings, and ends up with
even more copies than if he didn't bother.
That's a frequently misunderstood sentence. It doesn't mean that
intern will make copies; it simply means that the string you get back
from intern can be either the string you passed it or another
(previously interned) string object that is guaranteed to have the
same contents as your string (which makes it technically a "copy" of
the string you passed to intern).
Nov 25 '07 #19

P: n/a
Steven D'Aprano wrote:
>>>>store = {}
def atom(str):
global store
if str not in store:
store[str] = str
return store[str]

Oh lordy, that's really made my day! That's the funniest piece of code
I've seen for a long time! Worthy of being submitted to the DailyWTF.
Here's a script to show atom()'s effect on memory footprint:

$ cat atom.py
import sys
data = [1]*1000
items = []
cache = {}
if "-a" in sys.argv:
def atom(obj):
try:
return cache[obj]
except KeyError:
cache[obj] = obj
return obj
else:
def atom(obj):
return obj
try:
while 1:
items.append(atom(tuple(data)))
except MemoryError:
print len(items)
$ ulimit -v 5000
$ python atom.py
226
$ python atom.py -a
185742

So if you are going to submit Sam's function make sure to bundle it with
this little demo...

Peter
Nov 25 '07 #20

P: n/a
On Sun, 25 Nov 2007 10:39:38 +0100, Peter Otten wrote:
So if you are going to submit Sam's function make sure to bundle it with
this little demo...
Well Peter, I was going to reply with a comment about not changing the
problem domain (tuples of ints to trigrams from a text file for natural
language processing, that is, three character alphanumeric strings), and
that if you re-did your test with strings (as I did) you would see
absolutely no difference. What I was going to say was "Tuples aren't
interned. Short strings that look like identifiers are. Jumping through
hoops to cache things which are already cached is not productive
programming."

But then I dug a little deeper, and disassembled the code I was running,
and discovered that I was being fooled by the Python compiler's constant-
folding, and if I took steps to defeat the optimizer, the effect I was
seeing disappeared, and I got the same results as you.

Well. So I've learned something new: Python doesn't intern strings in the
way I thought it did. I don't quite know *how* it decides which strings
to intern and which ones not to, but at least I've learnt that what I
thought was true is not true.

So I offer my apology to Samwyse, the caching code isn't as redundant and
silly as it appears, and humbly tuck into this nice steaming plate of
crow.

Somebody pass the salt please.

--
Steven.
Nov 25 '07 #21

P: n/a
On Nov 24, 6:42 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.auwrote:
This has nothing, absolutely NOTHING, to do with memoization. Memoization
trades off memory for time, allowing slow functions to return results
faster at the cost of using more memory. The OP wants to save memory, not
use more of it.
Not to beat a dead horse, but memoization can significantly minimize
memory usage, given a large data set with redundant elements (as the
OP seems to suggest [e.g., calculating the deltas of trigrams in a
natural language]). For example, if the data set has 1/3 redundant
elements, then the un-memoized version requires 1/3 more memory,
because it needs to store 1/3 more unique copies of the element,
whereas the memoized version has only to store unique elements once.
Every instance of an element which is already in the cache requires
only the cache lookup (speed), rather than the creation of a new
object (memory). So the trade-off is actually speed for memory rather
than the other way around. Of course, it all depends on the nature of
the data; but for large, redundant data sets, memoization is
definitely a win when it comes to memory optimization.

Regards,
Jordan
Nov 25 '07 #22

P: n/a
Steven D'Aprano wrote:
On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:
>samwyse <sa*****@gmail.comwrites:
>>create a hash that maps your keys to themselves, then use the values of
that hash as your keys.
The "atom" function you describe already exists under the name "intern".

Not really. intern() works very differently, because it can tie itself to
the Python internals. Samwyse's atom() function doesn't, and so has no
purpose.
In any case, I'm not sure that intern() actually will solve the OP's
problem, even assuming it is a real and not imaginary problem. According
to the docs, intern()'s purpose is to speed up dictionary lookups, not to
save memory. I suspect that if it does save memory, it will be by
accident.
>>From the docs:
http://docs.python.org/lib/non-essen...-in-funcs.html
Yes, but it seems that buffer remains
with Python 3000.

Colin W.
>
intern( string)
Enter string in the table of ``interned'' strings and return the interned
string - which is string itself or a copy. Interning strings is useful to
gain a little performance on dictionary lookup - if the keys in a
dictionary are interned, and the lookup key is interned, the key
comparisons (after hashing) can be done by a pointer compare instead of a
string compare. Normally, the names used in Python programs are
automatically interned, and the dictionaries used to hold module, class
or instance attributes have interned keys. Changed in version 2.3:
Interned strings are not immortal (like they used to be in Python 2.2 and
before); you must keep a reference to the return value of intern() around
to benefit from it.
Note the words "which is string itself or a copy". It would be ironic if
the OP uses intern to avoid having copies of strings, and ends up with
even more copies than if he didn't bother.

I guess he'll actually need to measure his memory consumption and see
whether he actually has a memory problem or not, right?

Nov 25 '07 #23

P: n/a
Steven D'Aprano wrote:
On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:
>samwyse <sa*****@gmail.comwrites:
>>create a hash that maps your keys to themselves, then use the values of
that hash as your keys.
The "atom" function you describe already exists under the name "intern".

Not really. intern() works very differently, because it can tie itself to
the Python internals. Samwyse's atom() function doesn't, and so has no
purpose.
In any case, I'm not sure that intern() actually will solve the OP's
problem, even assuming it is a real and not imaginary problem. According
to the docs, intern()'s purpose is to speed up dictionary lookups, not to
save memory. I suspect that if it does save memory, it will be by
accident.
>>From the docs:
http://docs.python.org/lib/non-essen...-in-funcs.html
Yes, but it seems that buffer remains
with Python 3000.

Colin W.
>
intern( string)
Enter string in the table of ``interned'' strings and return the interned
string - which is string itself or a copy. Interning strings is useful to
gain a little performance on dictionary lookup - if the keys in a
dictionary are interned, and the lookup key is interned, the key
comparisons (after hashing) can be done by a pointer compare instead of a
string compare. Normally, the names used in Python programs are
automatically interned, and the dictionaries used to hold module, class
or instance attributes have interned keys. Changed in version 2.3:
Interned strings are not immortal (like they used to be in Python 2.2 and
before); you must keep a reference to the return value of intern() around
to benefit from it.
Note the words "which is string itself or a copy". It would be ironic if
the OP uses intern to avoid having copies of strings, and ends up with
even more copies than if he didn't bother.

I guess he'll actually need to measure his memory consumption and see
whether he actually has a memory problem or not, right?

Nov 25 '07 #24

P: n/a
On Nov 24, 4:44 am, Licheng Fang <fanglich...@gmail.comwrote:
>
Yes, millions. In my natural language processing tasks, I almost
always need to define patterns, identify their occurrences in a huge
data, and count them. Say, I have a big text file, consisting of
millions of words, and I want to count the frequency of trigrams:
I have some experience with this, helping my wife do computational
linguistics.

(I also have quite a lot of experience with similar things in my day
job, which is a decentralized storage grid written in Python.)

Unfortunately, Python is not a perfect tool for the job because, as
you've learned, Python isn't overly concerned about conserving
memory. Each object has substantial overhead associated with it
(including each integer, each string, each tuple, ...), and dicts add
overhead due to being sparsely filled. You should do measurements
yourself to get results for your local CPU and OS, but I found, for
example, that storing 20-byte keys and 8-byte values as a Python dict
of Python strings took about 100 bytes per entry.

Try "tokenizing" your trigrams by defining a dict from three unigrams
to a sequentially allocated integer "trigram id" (also called a
"trigram token"), and a reverse dict which goes from a trigram id to
the three unigrams. Whenever you create a new set of three Python
objects representing unigrams, you can pass them through the first
mapping to get the trigram id and then free up the original three
Python objects. If you do this multiple times, you get multiple
references to the same integer object for the trigram id.

My wife and I tried this, but it still wasn't compact enough to
process her datasets in a mere 4 GiB of RAM.

One tool that might help is PyJudy:

http://www.dalkescientific.com/Python/PyJudy.html

Judy is a delightfully memory-efficient, fast, and flexible data
structure. In the specific example of trigram counting (which is also
what my wife was doing), you can, for example, assign each to each
unigram an integer, and assuming that you have less than two million
unigrams you can pack three unigrams into a 64-bit integer... Hm,
actually at this point my wife and I stopped using Python and rewrote
it in C using JudyTrees. (At the time, PyJudy didn't exist.)

If you are interested, please e-mail my wife, Amber Wilcox-O'Hearn and
perhaps she'll share the resulting C code with you.

Regards,

Zooko Wilcox-O'Hearn
Dec 3 '07 #25

P: n/a
On Nov 24, 6:38 pm, Hrvoje Niksic <hnik...@xemacs.orgwrote:
samwyse <samw...@gmail.comwrites:
create a hash that maps your keys to themselves, then use the values
of that hash as your keys.

The "atom" function you describe already exists under the name
"intern".
D'oh! That's what I get for not memorizing "Non-essential Built-in
Functions".

In my defense, however, my function will work with anything that can
be used as a dictionary key (strings, tuples, frozen sets, etc), not
just character strings; thus we return to the original:
>>a=(1,2,3)
b=(1,1+1,1+1+1)
a == b
True
>>a is b
False
>>atom(a) is atom(b)
True
>>intern(a) is intern(b)
Traceback (most recent call last):
File "<pyshell#33>", line 1, in ?
intern(a) is intern(b)
TypeError: intern() argument 1 must be string, not tuple
Dec 6 '07 #26

This discussion thread is closed

Replies have been disabled for this discussion.