473,800 Members | 2,507 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

interning strings


The interning of strings has me puzzled. Its seems to happen sometimes,
but not others. I can't decern the pattern and I can't seem to find
documentation regarding it.

I can find documentation of a builtin called 'intern' but its use seems
frowned upon these days.

For example, using py2.3.3, I find that string interning does seem to
happen sometimes ...
s1 = "the"
s2 = "the"
s1 is s2 True

And it even happens in this case ...
s = "aa"
s1 = s[:1]
s2 = s[-1:]
s1, s2 ('a', 'a') s1 is s2 True

But not in what appears an almost identical case ...
s = "the the"
s1 = s[:3]
s2 = s[-3:]
s1, s2 ('the', 'the') s1 is s2 False

BUT, oddly, it does seem to happen here ...
class X: .... pass
.... x = X()
y = "the"
x.the = 42
x.__dict__ {'the': 42} y is x.__dict__.keys ()[0]

True
Are there any language rules regarding when strings are interned and
then they are not? Should I be ignoring the apparent poor status of
'intern' and using it anyway? At worst, are there any CPyton 'accidents
of implementation' that I take advantage of?

Why do I need this? Well, I have to read in a very large XML document
and convert it into objects, and within the document many attributes
have common string values. To reduce the memory footprint, I'd like to
intern these commonly reference strings AND I'm wondering how much work
I need to do, and how much will happen automatically.

Any insights appreciated.

BTW, I'm aware that I can do string interning myself using a dict cache
(which is what ElementTree does internally). But, this whole subject
has got me curious now, and I'd like to understand a bit better. Would,
for example, using the builtin 'intern' give a better result than my
hand coded interning?

--
Mike

Jul 18 '05 #1
7 1643
Mike Thompson <none.by.e-mail> wrote:
The interning of strings has me puzzled.**Its*s eems*to*happen* sometimes,
but not others. I can't decern the pattern and I can't seem to find
documentation regarding it.


Strings of length < 2 are always interned:
a = ""
a is "" True a = " "
a is " " True "aname"[1] is "n" True

String constants that are potential attribute names are also interned: a = "could_be_a_nam e"
a is "could_be_a_nam e" True a = "could not be a name"
a is "could not be a name" False

....although the algorithm to determine whether a string constant could be a
name is simplistic (see all_name_chars( ) in compile.c, or believe that it
does what its name suggests): a = "1x"
a is "1x" True

Strings that are otherwise created are not interned: a = "aname"
a is "a" + "name"

False

By the way, why would you want to mess with these implementation details?
Use the == operator to compare strings and be happy ever after :-)

Peter

Jul 18 '05 #2
Peter Otten wrote:
String constants that are potential attribute names are also interned:


Peter has explained all this correctly, but this aspect needs some
stressing perhaps: string *literals* that are potential attribute
names are also interned. This interning is done in the compiler,
when the code object is created, so strings not created by the compiler
are not interned.

[all strings are "constant", i.e. immutable, so the statement
above might have been confusing]

Regards,
Martin
Jul 18 '05 #3

[snip very useful explanation]

By the way, why would you want to mess with these implementation details?
Use the == operator to compare strings and be happy ever after :-)


'==' won't help me, I'm afraid.

I need to improve the speed and memory footprint of an application which
reads in a very large XML document.

Some elements in the incoming documents can be filtered out, so I've
written my own SAX handler to extract just what I want. All the same,
the content being read in is substantial.

So, to further reduce memory footprint, my SAX handler tries to manually
intern (using dicts of strings) a lot of the duplicated content and
attributes coming from the XML documents. Also, I use the SAX feature
'feature_string _interning' to hopefully intern the strings used for
attribute names etc.

Which is all working fine, except that now, as a final process, I'd like
to understand interning a bit more.

From your explanation there seems to be no language rules, just
implementation accidents. And none of those will be particularly
helpful in my case.

However, I still think I'm going to try using the builtin 'intern'
rather than my own dict cache. That may provide an advantage, even if it
doesn't work with unicode.

--
Mike
Jul 18 '05 #4

A while ago, we faced a similar issue, trying to reduce total memory
usage and runtime of one of our Python applications which parses very
large log files (100+ MB).

One particular class is instantiated many times and changing just that
class to use __slots__ helped quite a bit. More details are here

<http://mail.python.org/pipermail/python-list/2004-May/220513.html>

/Jean Brouwers
ProphICy Semiconductor, Inc.

In article <41************ ***********@new s.optusnet.com. au>, Mike
Thompson wrote:
[snip very useful explanation]

By the way, why would you want to mess with these implementation details?
Use the == operator to compare strings and be happy ever after :-)


'==' won't help me, I'm afraid.

I need to improve the speed and memory footprint of an application which
reads in a very large XML document.

Some elements in the incoming documents can be filtered out, so I've
written my own SAX handler to extract just what I want. All the same,
the content being read in is substantial.

So, to further reduce memory footprint, my SAX handler tries to manually
intern (using dicts of strings) a lot of the duplicated content and
attributes coming from the XML documents. Also, I use the SAX feature
'feature_string _interning' to hopefully intern the strings used for
attribute names etc.

Which is all working fine, except that now, as a final process, I'd like
to understand interning a bit more.

From your explanation there seems to be no language rules, just
implementation accidents. And none of those will be particularly
helpful in my case.

However, I still think I'm going to try using the builtin 'intern'
rather than my own dict cache. That may provide an advantage, even if it
doesn't work with unicode.

--
Mike

Jul 18 '05 #5
[Mike Thompson]
...
From your explanation there seems to be no language rules, just
implementation accidents. And none of those will be particularly
helpful in my case.
String interning is purely an optimization. Python added the concept
to speed its own name lookups, and the rules it uses for
auto-interning are effective for that. It wasn't necessary to expose
the interning facilities to users to meet its goal, and, especially
since interned strings were originally immortal, it would have been a
horrible idea to intern all strings. The machinery was exposed just
because it's Pythonic to expose internals when reasonably possible.
There wasn't, and shouldn't be, an expectation that exposed internals
will be perfectly suited as-is to arbitrary applications.
However, I still think I'm going to try using the builtin 'intern' rather than my
own dict cache.
That's fine -- that's why it got exposed. Don't assume that any
string is interned unless you explicitly intern() it, and you'll be
happy (and it doesn't hurt to intern() a string that's already
interned -- you just get back a reference to the already-interned copy
then).

[earlier] I can find documentation of a builtin called 'intern' but its use seems
frowned upon these days.


Not by me, but it's never been useful to *most* apps, apart from the
indirect benefits they get from Python's internal uses of string
interning. It's rare that an app really wants some strings stored
uniquely, and possibly never than an app wants all strings stored
uniquely. Most apps that use explicit string interning appear to be
looking for no more than a partial workalike for Lisp symbols.
Jul 18 '05 #6
Mike Thompson <none.by.e-mail> wrote:
'==' won't help me, I'm afraid.

I need to improve the speed and memory footprint of an application which
reads in a very large XML document.
Yes, I should have read your post carefully. But I was preoccupied with
speed...
From*your*expla nation*there*se ems*to*be*no*la nguage*rules,*j ust
implementation accidents.**And *none*of*those* will*be*particu larly
helpful in my case.
With arbitrary strings the likelihood of a cache hit decreases fast. Using
your own dictionary and checking the refcounts could give you interesting
insights. Unfortunately there is no WeakDictionary with both keys and
values as weakrefs, so you have to do some work, or you will actually
_increase_ memory footprint.
However, I still think I'm going to try using the builtin 'intern'
rather than my own dict cache. That may provide an advantage, even if it
doesn't work with unicode.


You might at least choose an alias

my_intern = intern

then, lest you later regret that limitation.

Peter

Jul 18 '05 #7
"Martin v. Löwis" wrote:
Peter Otten wrote:
String constants that are potential attribute names are also interned:


Peter has explained all this correctly, but this aspect needs some
stressing perhaps: string *literals* that are potential attribute
names are also interned. This interning is done in the compiler,
when the code object is created, so strings not created by the compiler
are not interned.

[all strings are "constant", i.e. immutable, so the statement
above might have been confusing]


Yes, string "literal", not "constant" is the appropriate term for what I
meant.
For completeness here is an example demonstrating that names appearing as
"bare words" in the code are interned:
class X: .... def __getattr__(sel f, name):
.... return name
.... a = X().this_is_an_ attribute
X().this_is_an_ attribute is a

True

Peter
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

20
5782
by: Ravi | last post by:
Hi, I have about 200GB of data that I need to go through and extract the common first part of a line. Something like this. >>>a = "abcdefghijklmnopqrstuvwxyz" >>>b = "abcdefghijklmnopBHLHT" >>>c = extract(a,b) >>>print c "abcdefghijklmnop"
17
7404
by: Gordon Airport | last post by:
Has anyone suggested introducing a mutable string type (yes, of course) and distinguishing them from standard strings by the quote type - single or double? As far as I know ' and " are currently interchangeable in all circumstances (as long as they're paired) so there's no overloading to muddy the language. Of course there could be some interesting problems with current code that doesn't make a distinction, but it would be dead easy to fix...
14
2419
by: Samuel R. Neff | last post by:
Why would you cast two strings to objects to compare them? I saw code in an MS sample on MSDN and don't get it. if ( (object)name == (object)attr.name ) { both "name" and "attr.name" are declared as string. http://msdn.microsoft.com/XML/BuildingXML/XMLinNETFramework/default.aspx?pull=/library/en-us/dnxmlnet/html/XmlBkMkRead.asp Thanks,
73
2961
by: Rigga | last post by:
Hi all, I am wondering why string's are not true objects?.... Let me explain... If i write the code Dim x1 as String = "veg" Dim x2 as String = "veg" If x1 = x2 then
1
1912
by: Dave | last post by:
Hello All, I'm trying to clarify how Python avoids byte by byte string comparisons most of the time. As I understand, dictionaries keep strings, their keys (hash values), and caches of their keys. Caching keys helps to avoid recalculation of a string's hash value. So, when two strings need to be compared, only their cached keys are compared, which improves performance as there is no need for byte by byte comparison.
2
1137
by: almurph | last post by:
Intern Strings - am I usingthem right. I have heard a lot about intern string - so I wanted to use them to increas e speed of processing. I have a hastable that I am using to parse a string of the form: wordA wordB wordC wordD etc etc
5
1711
by: Chris Mullins | last post by:
I've spent some time recently looking into optimizing some memory usage in our products. Much of this was doing through the use of string Interning. I spent the time and checked numbers in both x86 and x64, and have published the results here: http://www.coversant.com/dotnetnuke/Default.aspx?tabid=88&EntryID=24 The benefits for our SoapBox suite of products are pretty compelling, memory wise. Before I roll the changes into our...
0
9695
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9555
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10514
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10287
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10260
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9099
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5479
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5616
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4156
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.