Optimisation Hints (dict processing and strings)

OPQ

Hi all,
I'd happy to have you share some thougts about ultimate optimisations
on those 2 topics:

(1)- adding one caractere at the end of a string (may be long)
(2)- in a dict mapping a key to a list of int, remove every entrie
where the list of int have of length < 2
So far, my attempts are
for (1):

longone=long one + char # where len(char)== 1

I known that string concatenation is time consuming, but a small test
on timeit seems to show that packing and creating an array for those 2
elements is equally time consuming
for (2):
for k in hash.keys()[:]: # Note : Their may be a lot of keys here
if len(hash[k])<2:
del hash[k]
Here again, I think the hash.keys duplication can be time *and* memory
consuming. But still better than (I suppose - no time it yet)
hash=dict([(k,v) for (k,v) in hash if len(v)>1])
I know that I can experiment further with timeit, but still would like
to hear from your experience.

Thanks !

--OPQ

Jul 18 '05 #1

Subscribe Reply

1436

Peter Hansen

OPQ wrote:

I'd happy to have you share some thougts about ultimate optimisations
on those 2 topics:

(1)- adding one caractere at the end of a string (may be long)
longone=lon gone + char # where len(char)== 1

I known that string concatenation is time consuming, but a small test
on timeit seems to show that packing and creating an array for those 2
elements is equally time consuming

You've misunderstood the comments about this area.
String concatenation is *not* "time consuming".
*Repeated* concatenations *will become very time
consuming*, along an exponential curve. That's
what the discussions about O(n^2) are referring
to. A single string concatenation is definitely
going to be faster than any amount of converting
to other data structures and such.

Basically, to concatenate two strings, Python
simply adds their sizes together and allocates
a new memory area of sufficient size, then copies
the bytes from each string to the new area. Not
much overhead there, for a single concatenation.

The problem comes when you try to scale this.
At some point, the overhead of allocating and
deallocating all the intermediate strings becomes
much larger than other approaches would take.
For example, by building a list with references
to all the strings, then using join(), you save
on the intermediate steps. Join performs the
length-summation and memory allocation steps
only once, and copies each individual string into
the final area only once. Much faster, provided
the overhead of growing a list as you append()
string references is small enough. And it is,
since growing a list is not O(n^2) in Python
(though I don't recall exactly what it is).

The key with this stuff is to realize that neither
approach is inherently faster: it depends on how
much concatenation you are doing. It could be
that concatenating up to four strings is faster
than using the append/join technique, or it could
be that it's faster up to twenty strings.

When you're not talking hundreds of operations,
and when you're not talking about actual *measured*
performance problems (in other words, when we would
call it "premature optimization"), focus on the
readability of the code and on keeping it simple
and clean. Worry about the correctness of the code,
and leave optimization worries till later. (But
understand this "big O" stuff well enough to
be able to avoid the real problem areas, when
possible.)

-Peter

Jul 18 '05 #2

Daniel Dittmar

OPQ wrote:

for (1):
longone=lon gone + char # where len(char)== 1

I known that string concatenation is time consuming, but a small test
on timeit seems to show that packing and creating an array for those 2
elements is equally time consuming

- use cStringIO instead
- or append all chars to a list and do "".join (listvar)

for (2):
for k in hash.keys()[:]: # Note : Their may be a lot of keys here
if len(hash[k])<2:
del hash[k]
Here again, I think the hash.keys duplication can be time *and* memory
consuming. But still better than (I suppose - no time it yet)
hash=dict([(k,v) for (k,v) in hash if len(v)>1])

- Try if it isn't faster to iterate using items instead of iterating
over keys
- use the dict.iter* methods to prevent building a list in memory. You
shouldn't use these values directly to delete the entry as this could
break the iterator:

for key in [k for (k, v) in hash.iteritems () if len (v) < 2]:
del hash (key)

This of course builds a list of keys to delete, which could also be large.

- also: hash.keys()[:] is not necessary, hash.keys () is already a copy

Daniel

Jul 18 '05 #3

John Machin

OPQ wrote:

(2)- in a dict mapping a key to a list of int, remove every entrie
where the list of int have of length < 2
So far, my attempts are
for (2):
for k in hash.keys()[:]: # Note : Their may be a lot of keys here
if len(hash[k])<2:
del hash[k]
Here again, I think the hash.keys duplication can be time *and* memory consuming. But still better than (I suppose - no time it yet)
hash=dict([(k,v) for (k,v) in hash if len(v)>1])

Firstly, don't call it "hash"; you are shadowing a built-in function --
as well as giving clues to your background :-)

Secondly, you are *triplicating* the keys. Mapping.keys() returns a
list. You can safely iterate over that to change the contents of the
mapping. Read the docs: "a.keys() a copy of a's list of keys".
Mapping.keys()[:] gives you a copy of the copy.

Cheers,
John

Jul 18 '05 #4

OPQ

#For the record, I'm not on premature optimisation anymore.
#The code is working. I just want to save hours of computing, without
relying to much on C extensions.
#Nevertheless, thansk for tips, clarifications and explanations.

>longone=lon gone + char # where len(char)== 1
I known that string concatenation is time consuming, but a small test
on timeit seems to show that packing and creating an array for those 2
elements is equally time consuming

- use cStringIO instead
- or append all chars to a list and do "".join (listvar)

Yes, but as Peter said using a list won't be useful for a single
operation.
And I have to compute the string at each step of the loop. I think
listvar won't be useful.
But I'm going to try cStringIO.

for (2):
for k in hash.keys()[:]: # Note : Their may be a lot of keys here
if len(hash[k])<2:
del hash[k]
- Try if it isn't faster to iterate using items instead of iterating
over keys

items are huge lists of numbers. keys are simple small strings. And
even if it is faster, how can I find the key back, in order to delete
it ?
for v in hashh.items():
if len(v)<2:
del ???????
- use the dict.iter* methods to prevent building a list in memory. You
shouldn't use these values directly to delete the entry as this could
break the iterator:

for key in [k for (k, v) in hash.iteritems () if len (v) < 2]:
del hash (key)

I gonna try, but think that would be overkill: a whole list has to be
computed !
Maybe whith genexps ...... for key in (k for (k,v) in hash.iteritems( )
if len(v)<2)

This of course builds a list of keys to delete, which could also be large.

Yes. Which tell me to use genexps.
- also: hash.keys()[:] is not necessary, hash.keys () is already a copy

Yes I got that one, but too late !

Thanks !

Jul 18 '05 #5

MyHaz

I posted a question about string concatination just last week. There is
plenty of discussion on it, if you just search the group.

Jul 18 '05 #6

Bengt Richter

On 29 Mar 2005 23:41:15 -0800, po***@hotmail.c om (OPQ) wrote:
[...]

> for k in hash.keys()[:]: # Note : Their may be a lot of keys here
> if len(hash[k])<2:
> del hash[k]
- Try if it isn't faster to iterate using items instead of iterating
over keys

items are huge lists of numbers. keys are simple small strings. And
even if it is faster, how can I find the key back, in order to delete
it ?

items() returns a list of (k,v) tuples, so you can unpack them
into the for assignment target, e.g.,

for k,v in hashh.items():
if len(v)<2:
del hassh[k]
for v in hashh.items():
if len(v)<2:
del ???????

but to avoid a big temporary items() list, perhaps (untested)

for k in [k for k,v in hashh.iteritems () if len(v)<2]:
del hassh[k]

UIAM that should build a temporary safe list
only of the keys selected for deletion.

Regards,
Bengt Richter

Jul 18 '05 #7

Daniel Dittmar

OPQ wrote:

- Try if it isn't faster to iterate using items instead of iterating
over keys

items are huge lists of numbers. keys are simple small strings. And
even if it is faster, how can I find the key back, in order to delete
it ?
for v in hashh.items():
if len(v)<2:
del ???????

To elaborate on the memory requirements of .keys () vs. items ():

..keys () creates a new list of n objects. The objects are additional
references to the existing keys.

..items () creates also a new list of n objects. These objects are tuples
of references, one to the key and one to the value. Only references
are used so it doesn't matter how large the value actually is. Whether
the tuples are created for the items () call or already exist depends on
the implementation of the dictionary. Trying to infer this by using
sys.getrefcount got me inconclusive results.
I gonna try, but think that would be overkill: a whole list has to be
computed !
Maybe whith genexps ...... for key in (k for (k,v) in hash.iteritems( )
if len(v)<2)

Using only iterators has problems:

for k,v in hash.iteritems ():
if len (v) < 2:
del hash [k]

You are changing hash while you iterate over it, this very often breaks
the iterator.

If you are memory bound, maybe a dabase like SQLite is really the way to
go. Or you could write the keys to a remporary file in the loop and then
write a second loop that reads the keys and deletes them from hash.

Daniel

Jul 18 '05 #8

Kent Johnson

OPQ wrote:

for (2):
for k in hash.keys()[:]: # Note : Their may be a lot of keys here
if len(hash[k])<2:
del hash[k]

- use the dict.iter* methods to prevent building a list in memory. You
shouldn't use these values directly to delete the entry as this could
break the iterator:

for key in [k for (k, v) in hash.iteritems () if len (v) < 2]:
del hash (key)

I gonna try, but think that would be overkill: a whole list has to be
computed !

Yes, but it is smaller than the list returned by hash.keys(), so it should be a win over what you
were doing originally. Plus it avoids a lookup (hash[k]) which may improve the speed also.

BTW I have long assumed that iterating key, value pairs of a dict using iteritems() is faster than
iterating with keys() followed by a lookup, since the former method should be able to avoid actually
hashing the key and looking it up.

I finally wrote a test, and my assumption seems to be correct; using iteritems() is about 1/3 faster
for simple keys.

Here is a simple test:

##

d = dict((i, i) for i in range(10000))

def withItems(d):
for k,v in d.iteritems():
pass

def withKeys(d):
for k in d:
d[k]

from timeit import Timer

for fn in [withItems, withKeys]:
name = fn.__name__
timer = Timer('%s(d)' % name, 'from __main__ import d, %s' % name)
print name, timer.timeit(10 00)

##

I get
withItems 0.980311184801
withKeys 1.37672944466

Kent

Jul 18 '05 #9

Similar topics

12243

filter()ing a dict

by: Robin Cull | last post by:

Imagine I have a dict looking something like this: myDict = {"key 1": , "key 2": , "key 3": , "key 4": } That is, a set of keys which have a variable length list of associated values after them. What I want to do is filter out a subset of this dict to produce another dict that satisfies a set of criteria (in this case whether it contains all four values) to end up with something

Python

1605

Help and optimization hints, anyone?

by: Kim Petersen | last post by:

I've worked on this table object a bit too long - and seem to have stared too long at the code. Can someone see where it goes wrong in the insertrow function? Also optimization hints and alternatives will be greatly appreciated. <code> #!env python # # created: "15:19 20/01-2004" by Kim Petersen <kim@vindinggaard.dk>

Python

25748

left padding zeroes on a string...

by: cjl | last post by:

Hey all: I want to convert strings (ex. '3', '32') to strings with left padded zeroes (ex. '003', '032'), so I tried this: string1 = '32' string2 = "%03s" % (string1) print string2 >32

Python

1939

Case-insensitive dict, non-destructive, fast, anyone?

by: Ville Vainio | last post by:

I need a dict (well, it would be optimal anyway) class that stores the keys as strings without coercing the case to upper or lower, but still provides fast lookup (i.e. uses hash table). >> d = CiDict() >> d 12

Python

2372

Optimisation techniques

by: EC-AKD | last post by:

Hi All, I am new to the concept of optimising codes in C. I was wondering if C level inlining of code is any way comparable to macros. I believe that inlining is equivalent to writing macros. However I had this strange feeling if I should go for macros wherever necessary instead of inlining the codes. In my project, I have to use basic operations like add and sub many times. So I initially inlined them and found a lot of optimisation.

C / C++

5520

Print dict in sorted order

by: Kamilche | last post by:

I have a code snippet here that prints a dict in an arbitrary order. (Certain keys first, with rest appearing in sorted order). I didn't want to subclass dict, that's error-prone, and overkill for my needs. I just need something that returns a value like dict.__str__, with a key ordering I specify. If you have any opinions on how it could be made better, I'm all ears!

Python

5285

type = "instance" instead of "dict"

by: Cruella DeVille | last post by:

I'm trying to implement a bookmark-url program, which accepts user input and puts the strings in a dictionary. Somehow I'm not able to iterate myDictionary of type Dict{} When I write print type(myDictionary) I get that the type is "instance", which makes no sense to me. What does that mean? Thanks

Python

2649

xml.parsers.expat loading xml into a dict and whitespace

by: kaens | last post by:

Hey everyone, this may be a stupid question, but I noticed the following and as I'm pretty new to using xml and python, I was wondering if I could get an explanation. Let's say I write a simple xml parser, for an xml file that just loads the content of each tag into a dict (the xml file doesn't have multiple hierarchies in it, it's flat other than the parent node) so we have <parent>

Python

1628

dict.get and str.xsplit

by: bearophileHUGS | last post by:

When I use dictionaries I find the dict.get() method very handy, and I'd like to use it often, but I think it's quite slow. This is a small benchmark: from random import randrange from timeit import default_timer as clock def main(N, M): keys = list(set(randrange(2*N) for n in xrange(N))) n2 = len(keys) / 2

Python

9711

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9593

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10595

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10343

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10335

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

10088

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

9169

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

4306

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3831

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP