473,386 Members | 1,821 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Interned Strings

Hello All,

I'm trying to clarify how Python avoids byte by byte
string comparisons most of the time. As I understand,
dictionaries keep strings, their keys (hash values),
and caches of their keys. Caching keys helps to avoid
recalculation of a string's hash value. So, when two
strings need to be compared, only their cached keys
are compared, which improves performance as there is
no need for byte by byte comparison.

Also, there is a global interning dictionary that
keeps interned strings. What I don't understand is why
strings are interned. How does it help with string
comparisons?

Thank you.

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Jan 10 '06 #1
1 1894
Dave wrote:
I'm trying to clarify how Python avoids byte by byte
string comparisons most of the time. As I understand,
dictionaries keep strings, their keys (hash values),
and caches of their keys.
If you are talking about dictionaries with string keys,
you also have the dictionary values, of course.
Caching keys helps to avoid
recalculation of a string's hash value.
Correct (s/keys/string hash values/).
So, when two
strings need to be compared, only their cached keys
are compared, which improves performance as there is
no need for byte by byte comparison.
No. When comparing to strings s1 and s2, the following
operations occur:
1. is s1 and s2 the very same string? If yes, they
are equal.
2. else, do they have the same size, the same first
byte (which might be a null byte), and do they
compare equal, byte-by-byte?
If yes, they are equal, if not, they are not equal.
3. Is it perhaps some other compare operation (<,>

<=, >, >=) that we want to perform? Do the
slow algorithm.

As you can see, the string hash is never consulted when
comparing string objects. It is only consulted to
find the potential dictionary slot in the first place.
Also, there is a global interning dictionary that
keeps interned strings. What I don't understand is why
strings are interned. How does it help with string
comparisons?


Why you look up a dictionary entry, this happens:
1. compute the key hash.
2. find the corresponding dictionary slot
If the slot is empty, KeyError.
3. compare the slot key with the search key.
If they are equal: value found.
If they are different: collision, go to the
next key.

Interned strings speed up step 1 and step 3. If
you only have interned strings throughout, you always
also have the hash value. Of course, you had to
compute the hash value when adding the string
to the interning dictionary.

The real speedup is in step 3, and based on
the assumption that collisions won't happen:
if you lookup the key (e.g. to find the value
of a global variable), you find the slot using
the computed hash. Then:
1. if the slot is empty, it's a KeyError.
2. if the slot is not empty, you first compare
for pointer equality. As collisions are
supposedly unlikely, this will be an equal
string most of the time. Then, if you have
interning, it even will be the *same* string,
so you only need to compare pointers to find
out they are the same string.

So assuming all strings are interned, this is how
you do the dictionary lookup.
1. fetch the hash value from the key (no need to
compute it - it's already cached)
2. go to the slot in the dict.
3. if the slot is empty (==NULL): KeyError
4. otherwise: success.
As you can see, in that case, there is no need to
compare the string values. If there are collisions,
and if not all strings are interned, the algorithm
gets more complicated, but four items above are
assumed to be the common case.

Regards,
Martin
Jan 10 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

20
by: Ravi | last post by:
Hi, I have about 200GB of data that I need to go through and extract the common first part of a line. Something like this. >>>a = "abcdefghijklmnopqrstuvwxyz" >>>b = "abcdefghijklmnopBHLHT"...
17
by: Gordon Airport | last post by:
Has anyone suggested introducing a mutable string type (yes, of course) and distinguishing them from standard strings by the quote type - single or double? As far as I know ' and " are currently...
1
by: Byron Morgan | last post by:
Anyone run into this before? I have a python app that has been reliable, running for days on end without a crash. Suddenly, It repeatedly crashes with the following message: "Fatal Python...
16
by: Paul Prescod | last post by:
I skimmed the tutorial and something alarmed me. "Strings are a powerful data type in Prothon. Unlike many languages, they can be of unlimited size (constrained only by memory size) and can hold...
4
by: agent349 | last post by:
First off, I know arrays can't be compared directly (ie: if (arrary1 == array2)). However, I've been trying to compare two arrays using pointers with no success. Basically, I want to take three...
4
by: Jon Skeet | last post by:
I've just noticed something rather odd and disturbing. The following code displays "True": using System; class Test { public static void Main(string args) { string x = new string...
25
by: Rainmaker | last post by:
Hi, Can anyone tell me an efficient algorithm to sort an array of strings? Keep in mind that this array is HUGE and so the algorithm should me efficient enough to deal with it. Thanks
2
by: Potiuper | last post by:
Question: Is it possible to use a char pointer array ( char *<name> ) to read an array of strings from a file in C? Given: code is written in ANSI C; I know the exact nature of the strings to be...
95
by: hstagni | last post by:
Where can I find a library to created text-based windows applications? Im looking for a library that can make windows and buttons inside console.. Many old apps were make like this, i guess ...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.