unicode speed - Python

David Siroky

Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*5000000 0
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*5000000 0
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?

Another situation: speed problem with long strings

I have a simple function for removing diacritics from a string:

#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-

import unicodedata

def no_diacritics(l ine):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.nor malize('NFKD', line)

output = ''
for c in line:
if not unicodedata.com bining(c):
output += c
return output

Now the calling sequence (and time results):

for i in xrange(1):
x = u"a"*50000
y = no_diacritics(x )

real 0m17.021s
user 0m11.139s
sys 0m5.116s

for i in xrange(5):
x = u"a"*10000
y = no_diacritics(x )

real 0m0.548s
user 0m0.502s
sys 0m0.004s

In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.

Thanks for notes!

David

Nov 29 '05 #1

Subscribe Reply

1949

Neil Hodgson

David Siroky:

output = ''
I suspect you really want "output = u''" here.
for c in line:
if not unicodedata.com bining(c):
output += c

This is creating as many as 50000 new string objects of increasing
size. To build large strings, some common faster techniques are to
either create a list of characters and then use join on the list or use
a cStringIO to accumulate the characters.

This is about 10 times faster for me:

def no_diacritics(l ine):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.nor malize('NFKD', line)

output = []
for c in line:
if not unicodedata.com bining(c):
output.append(c )
return u''.join(output )

Neil

Nov 29 '05 #2

Tony Nelson

In article <pa************ *************** *@email.cz>,
David Siroky <ds*****@email. cz> wrote:

Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*5000000 0
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*5000000 0
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?
Your first example uses about 50 MB. Your second uses about 200 MB, (or
100 MB if your Python is compiled oddly). Check the size of Unicode
chars by:

import sys
hex(sys.maxunic ode)

If it says '0x10ffff' each unichar uses 4 bytes; if it says '0xffff',
each unichar uses 2 bytes.

Another situation: speed problem with long strings

I have a simple function for removing diacritics from a string:

#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-

import unicodedata

def no_diacritics(l ine):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.nor malize('NFKD', line)

output = ''
for c in line:
if not unicodedata.com bining(c):
output += c
return output

Now the calling sequence (and time results):

for i in xrange(1):
x = u"a"*50000
y = no_diacritics(x )

real 0m17.021s
user 0m11.139s
sys 0m5.116s

for i in xrange(5):
x = u"a"*10000
y = no_diacritics(x )

real 0m0.548s
user 0m0.502s
sys 0m0.004s

In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.

It has to do with how strings (either kind) are implemented. Strings
are "immutable" , so string concatination is done by making a new string
that has the concatenated value, ans assigning it to the left-hand-side.
Often, it is faster (but more memory intensive) to append to a list and
then at the end do a u''.join(mylist ). See GvR's essay on optimization
at <http://www.python.org/doc/essays/list2str.html>.

Alternatively, you could use array.array from the Python Library (it's
easy) to get something "just as good as" mutable strings.
_______________ _______________ _______________ _______________ ____________
TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
' <http://www.georgeanels on.com/>

Nov 29 '05 #3

jepler

On Tue, Nov 29, 2005 at 09:48:15AM +0100, David Siroky wrote:

Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*5000000 0
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*5000000 0
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?

string objects have the optimization described in the log message below.
The same optimization hasn't been made to unicode_repeat, though it would
probably also benefit from it.

------------------------------------------------------------------------
r30616 | rhettinger | 2003-01-06 04:33:56 -0600 (Mon, 06 Jan 2003) | 11 lines

Optimize string_repeat.

Christian Tismer pointed out the high cost of the loop overhead and
function call overhead for 'c' * n where n is large. Accordingly,
the new code only makes lg2(n) loops.

Interestingly, 'c' * 1000 * 1000 ran a bit faster with old code. At some
point, the loop and function call overhead became cheaper than invalidating
the cache with lengthy memcpys. But for more typical sizes of n, the new
code runs much faster and for larger values of n it runs only a bit slower.
------------------------------------------------------------------------

If you're a "C" coder too, consider creating and submitting a patch to do this
to the patch tracker on http://sf.net/projects/python . That's the best thing
you can do to ensure the optimization is considered for a future release of
Python.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDjNC3Jd0 1MZaTXX0RAhb8AJ wLUv2jNdYPY9Crf H6c1OWpsUTcgwCe Pq8s
cQGDYnsKRdAv6JO 3Zmr3jao=
=906V
-----END PGP SIGNATURE-----

Nov 29 '05 #4

David Siroky

V Tue, 29 Nov 2005 10:14:26 +0000, Neil Hodgson napsal(a):

David Siroky:
output = ''
I suspect you really want "output = u''" here.
for c in line:
if not unicodedata.com bining(c):
output += c

This is creating as many as 50000 new string objects of increasing
size. To build large strings, some common faster techniques are to
either create a list of characters and then use join on the list or use
a cStringIO to accumulate the characters.

That is the answer I wanted, now I'm finally enlightened! :-)

This is about 10 times faster for me:

def no_diacritics(l ine):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.nor malize('NFKD', line)

output = []
for c in line:
if not unicodedata.com bining(c):
output.append(c )
return u''.join(output )

Neil

Thanx!

David

Nov 30 '05 #5

Similar topics

2243

how to print unicode structures?

by: Timothy Babytch | last post by:

Imagine you have some list that looks like ('unicode', 'not-acii', 'russian') and contains characters not from acsii. or list of dicts, or dict of dicts. how can I print it? not on by one, with "for" - but with just a simple print? My debugging would be MUCH simpler. Now when I try print or pprint that variable I get a page full of '\xe4\xeb\xa2\xa0\xe6\xe3\xaa\xe6\xe3\xaa' and so on.

Python

2628

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is feasible. It would be helpful if people could test the patched Python with their own applications and report any incompatibilities. PEP: 349

Python

1512

[regex] case-splitting strings in unicode

by: John Perks and Sarah Mount | last post by:

I have to split some identifiers that are casedLikeThis into their component words. In this instance I can safely use to represent uppercase, but what pattern should I use if I wanted it to work more generally? I can envisage walking the string testing the unicodedata.category of each char, but is there a regex'y way to denote "uppercase"? Thanks John

Python

1282

SPE 0.7.5.e - Python IDE with improved uml, debugger & unicode support

by: SPE - Stani's Python Editor | last post by:

What's new? SPE now creates backup files and can insert your standard signature (with for example license and copyright information) in your code. A bug that prevented SPE to start on Linux has been fixed and also a lot of bugfixes were implemented, especially for unicode. You can read more on the SPE news blog. If you like SPE, please contribute by coding, writing documentation or donating. Spread the word on blogs, ...

Python

12300

Inserting unicode characters into a database using RecordSet.Update

by: Roger Withnell | last post by:

I'm using ASP, VBScript and SQL Server. I'm also using UTF-8 character set and so my codepage is 65001 and SQL Server datatype nvarchar. I can insert unicode characters correctly into the database table using INSERT.... (field1) ...VALUES ......... (N'Characters'). How do I do this using Rs.Update viz-a-viz:

ASP / Active Server Pages

3097

Finding Upper-case characters in regexps, unicode friendly.

by: possibilitybox | last post by:

I'm trying to make a unicode friendly regexp to grab sentences reasonably reliably for as many unicode languages as possible, focusing on european languages first, hence it'd be useful to be able to refer to any uppercase unicode character instead of just the typical , which doesn't include, for example É. Is there a way to do this, or do I have to stick with using the isupper method of the string class?

Python

1594

unicode, bytes redux

by: willie | last post by:

(beating a dead horse) Is it too ridiculous to suggest that it'd be nice if the unicode object were to remember the encoding of the string it was decoded from? So that it's feasible to calculate the number of bytes that make up the unicode code points. # U+270C # 11100010 10011100 10001100

Python

5852

MySQL Insert Unicode Problem

by: erikcw | last post by:

Hi, I'm trying to insert some data from an XML file into MySQL. However, while importing one of the files, I got this error: Traceback (most recent call last): File "wa.py", line 304, in ? main() File "wa.py", line 257, in main curHandler.walkData()

Python

2186

write Python dict (mb with unicode) to a file

by: dmitrey | last post by:

hi all, what's the best way to write Python dictionary to a file? (and then read) There could be unicode field names and values encountered. Thank you in advance, D.

Python

8889

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8752

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9401

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9257

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9116

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8099

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

4519

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4784

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2157

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General