473,287 Members | 1,560 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,287 software developers and data experts.

interning strings


The interning of strings has me puzzled. Its seems to happen sometimes,
but not others. I can't decern the pattern and I can't seem to find
documentation regarding it.

I can find documentation of a builtin called 'intern' but its use seems
frowned upon these days.

For example, using py2.3.3, I find that string interning does seem to
happen sometimes ...
s1 = "the"
s2 = "the"
s1 is s2 True

And it even happens in this case ...
s = "aa"
s1 = s[:1]
s2 = s[-1:]
s1, s2 ('a', 'a') s1 is s2 True

But not in what appears an almost identical case ...
s = "the the"
s1 = s[:3]
s2 = s[-3:]
s1, s2 ('the', 'the') s1 is s2 False

BUT, oddly, it does seem to happen here ...
class X: .... pass
.... x = X()
y = "the"
x.the = 42
x.__dict__ {'the': 42} y is x.__dict__.keys()[0]

True
Are there any language rules regarding when strings are interned and
then they are not? Should I be ignoring the apparent poor status of
'intern' and using it anyway? At worst, are there any CPyton 'accidents
of implementation' that I take advantage of?

Why do I need this? Well, I have to read in a very large XML document
and convert it into objects, and within the document many attributes
have common string values. To reduce the memory footprint, I'd like to
intern these commonly reference strings AND I'm wondering how much work
I need to do, and how much will happen automatically.

Any insights appreciated.

BTW, I'm aware that I can do string interning myself using a dict cache
(which is what ElementTree does internally). But, this whole subject
has got me curious now, and I'd like to understand a bit better. Would,
for example, using the builtin 'intern' give a better result than my
hand coded interning?

--
Mike

Jul 18 '05 #1
7 1619
Mike Thompson <none.by.e-mail> wrote:
The interning of strings has me puzzled.**Its*seems*to*happen*sometimes,
but not others. I can't decern the pattern and I can't seem to find
documentation regarding it.


Strings of length < 2 are always interned:
a = ""
a is "" True a = " "
a is " " True "aname"[1] is "n" True

String constants that are potential attribute names are also interned: a = "could_be_a_name"
a is "could_be_a_name" True a = "could not be a name"
a is "could not be a name" False

....although the algorithm to determine whether a string constant could be a
name is simplistic (see all_name_chars() in compile.c, or believe that it
does what its name suggests): a = "1x"
a is "1x" True

Strings that are otherwise created are not interned: a = "aname"
a is "a" + "name"

False

By the way, why would you want to mess with these implementation details?
Use the == operator to compare strings and be happy ever after :-)

Peter

Jul 18 '05 #2
Peter Otten wrote:
String constants that are potential attribute names are also interned:


Peter has explained all this correctly, but this aspect needs some
stressing perhaps: string *literals* that are potential attribute
names are also interned. This interning is done in the compiler,
when the code object is created, so strings not created by the compiler
are not interned.

[all strings are "constant", i.e. immutable, so the statement
above might have been confusing]

Regards,
Martin
Jul 18 '05 #3

[snip very useful explanation]

By the way, why would you want to mess with these implementation details?
Use the == operator to compare strings and be happy ever after :-)


'==' won't help me, I'm afraid.

I need to improve the speed and memory footprint of an application which
reads in a very large XML document.

Some elements in the incoming documents can be filtered out, so I've
written my own SAX handler to extract just what I want. All the same,
the content being read in is substantial.

So, to further reduce memory footprint, my SAX handler tries to manually
intern (using dicts of strings) a lot of the duplicated content and
attributes coming from the XML documents. Also, I use the SAX feature
'feature_string_interning' to hopefully intern the strings used for
attribute names etc.

Which is all working fine, except that now, as a final process, I'd like
to understand interning a bit more.

From your explanation there seems to be no language rules, just
implementation accidents. And none of those will be particularly
helpful in my case.

However, I still think I'm going to try using the builtin 'intern'
rather than my own dict cache. That may provide an advantage, even if it
doesn't work with unicode.

--
Mike
Jul 18 '05 #4

A while ago, we faced a similar issue, trying to reduce total memory
usage and runtime of one of our Python applications which parses very
large log files (100+ MB).

One particular class is instantiated many times and changing just that
class to use __slots__ helped quite a bit. More details are here

<http://mail.python.org/pipermail/python-list/2004-May/220513.html>

/Jean Brouwers
ProphICy Semiconductor, Inc.

In article <41***********************@news.optusnet.com.au> , Mike
Thompson wrote:
[snip very useful explanation]

By the way, why would you want to mess with these implementation details?
Use the == operator to compare strings and be happy ever after :-)


'==' won't help me, I'm afraid.

I need to improve the speed and memory footprint of an application which
reads in a very large XML document.

Some elements in the incoming documents can be filtered out, so I've
written my own SAX handler to extract just what I want. All the same,
the content being read in is substantial.

So, to further reduce memory footprint, my SAX handler tries to manually
intern (using dicts of strings) a lot of the duplicated content and
attributes coming from the XML documents. Also, I use the SAX feature
'feature_string_interning' to hopefully intern the strings used for
attribute names etc.

Which is all working fine, except that now, as a final process, I'd like
to understand interning a bit more.

From your explanation there seems to be no language rules, just
implementation accidents. And none of those will be particularly
helpful in my case.

However, I still think I'm going to try using the builtin 'intern'
rather than my own dict cache. That may provide an advantage, even if it
doesn't work with unicode.

--
Mike

Jul 18 '05 #5
[Mike Thompson]
...
From your explanation there seems to be no language rules, just
implementation accidents. And none of those will be particularly
helpful in my case.
String interning is purely an optimization. Python added the concept
to speed its own name lookups, and the rules it uses for
auto-interning are effective for that. It wasn't necessary to expose
the interning facilities to users to meet its goal, and, especially
since interned strings were originally immortal, it would have been a
horrible idea to intern all strings. The machinery was exposed just
because it's Pythonic to expose internals when reasonably possible.
There wasn't, and shouldn't be, an expectation that exposed internals
will be perfectly suited as-is to arbitrary applications.
However, I still think I'm going to try using the builtin 'intern' rather than my
own dict cache.
That's fine -- that's why it got exposed. Don't assume that any
string is interned unless you explicitly intern() it, and you'll be
happy (and it doesn't hurt to intern() a string that's already
interned -- you just get back a reference to the already-interned copy
then).

[earlier] I can find documentation of a builtin called 'intern' but its use seems
frowned upon these days.


Not by me, but it's never been useful to *most* apps, apart from the
indirect benefits they get from Python's internal uses of string
interning. It's rare that an app really wants some strings stored
uniquely, and possibly never than an app wants all strings stored
uniquely. Most apps that use explicit string interning appear to be
looking for no more than a partial workalike for Lisp symbols.
Jul 18 '05 #6
Mike Thompson <none.by.e-mail> wrote:
'==' won't help me, I'm afraid.

I need to improve the speed and memory footprint of an application which
reads in a very large XML document.
Yes, I should have read your post carefully. But I was preoccupied with
speed...
From*your*explanation*there*seems*to*be*no*languag e*rules,*just
implementation accidents.**And*none*of*those*will*be*particularly
helpful in my case.
With arbitrary strings the likelihood of a cache hit decreases fast. Using
your own dictionary and checking the refcounts could give you interesting
insights. Unfortunately there is no WeakDictionary with both keys and
values as weakrefs, so you have to do some work, or you will actually
_increase_ memory footprint.
However, I still think I'm going to try using the builtin 'intern'
rather than my own dict cache. That may provide an advantage, even if it
doesn't work with unicode.


You might at least choose an alias

my_intern = intern

then, lest you later regret that limitation.

Peter

Jul 18 '05 #7
"Martin v. Löwis" wrote:
Peter Otten wrote:
String constants that are potential attribute names are also interned:


Peter has explained all this correctly, but this aspect needs some
stressing perhaps: string *literals* that are potential attribute
names are also interned. This interning is done in the compiler,
when the code object is created, so strings not created by the compiler
are not interned.

[all strings are "constant", i.e. immutable, so the statement
above might have been confusing]


Yes, string "literal", not "constant" is the appropriate term for what I
meant.
For completeness here is an example demonstrating that names appearing as
"bare words" in the code are interned:
class X: .... def __getattr__(self, name):
.... return name
.... a = X().this_is_an_attribute
X().this_is_an_attribute is a

True

Peter
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

20
by: Ravi | last post by:
Hi, I have about 200GB of data that I need to go through and extract the common first part of a line. Something like this. >>>a = "abcdefghijklmnopqrstuvwxyz" >>>b = "abcdefghijklmnopBHLHT"...
17
by: Gordon Airport | last post by:
Has anyone suggested introducing a mutable string type (yes, of course) and distinguishing them from standard strings by the quote type - single or double? As far as I know ' and " are currently...
14
by: Samuel R. Neff | last post by:
Why would you cast two strings to objects to compare them? I saw code in an MS sample on MSDN and don't get it. if ( (object)name == (object)attr.name ) { both "name" and "attr.name" are...
73
by: Rigga | last post by:
Hi all, I am wondering why string's are not true objects?.... Let me explain... If i write the code Dim x1 as String = "veg" Dim x2 as String = "veg" If x1 = x2 then
1
by: Dave | last post by:
Hello All, I'm trying to clarify how Python avoids byte by byte string comparisons most of the time. As I understand, dictionaries keep strings, their keys (hash values), and caches of their...
2
by: almurph | last post by:
Intern Strings - am I usingthem right. I have heard a lot about intern string - so I wanted to use them to increas e speed of processing. I have a hastable that I am using to parse a string...
5
by: Chris Mullins | last post by:
I've spent some time recently looking into optimizing some memory usage in our products. Much of this was doing through the use of string Interning. I spent the time and checked numbers in both x86...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.