473,698 Members | 2,023 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

unicode speed

Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*5000000 0
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*5000000 0
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?

Another situation: speed problem with long strings

I have a simple function for removing diacritics from a string:

#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-

import unicodedata

def no_diacritics(l ine):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.nor malize('NFKD', line)

output = ''
for c in line:
if not unicodedata.com bining(c):
output += c
return output

Now the calling sequence (and time results):

for i in xrange(1):
x = u"a"*50000
y = no_diacritics(x )

real 0m17.021s
user 0m11.139s
sys 0m5.116s

for i in xrange(5):
x = u"a"*10000
y = no_diacritics(x )

real 0m0.548s
user 0m0.502s
sys 0m0.004s

In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.

Thanks for notes!

David

Nov 29 '05 #1
4 1947
David Siroky:
output = ''
I suspect you really want "output = u''" here.
for c in line:
if not unicodedata.com bining(c):
output += c


This is creating as many as 50000 new string objects of increasing
size. To build large strings, some common faster techniques are to
either create a list of characters and then use join on the list or use
a cStringIO to accumulate the characters.

This is about 10 times faster for me:

def no_diacritics(l ine):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.nor malize('NFKD', line)

output = []
for c in line:
if not unicodedata.com bining(c):
output.append(c )
return u''.join(output )

Neil
Nov 29 '05 #2
In article <pa************ *************** *@email.cz>,
David Siroky <ds*****@email. cz> wrote:
Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*5000000 0
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*5000000 0
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?
Your first example uses about 50 MB. Your second uses about 200 MB, (or
100 MB if your Python is compiled oddly). Check the size of Unicode
chars by:
import sys
hex(sys.maxunic ode)


If it says '0x10ffff' each unichar uses 4 bytes; if it says '0xffff',
each unichar uses 2 bytes.

Another situation: speed problem with long strings

I have a simple function for removing diacritics from a string:

#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-

import unicodedata

def no_diacritics(l ine):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.nor malize('NFKD', line)

output = ''
for c in line:
if not unicodedata.com bining(c):
output += c
return output

Now the calling sequence (and time results):

for i in xrange(1):
x = u"a"*50000
y = no_diacritics(x )

real 0m17.021s
user 0m11.139s
sys 0m5.116s

for i in xrange(5):
x = u"a"*10000
y = no_diacritics(x )

real 0m0.548s
user 0m0.502s
sys 0m0.004s

In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.


It has to do with how strings (either kind) are implemented. Strings
are "immutable" , so string concatination is done by making a new string
that has the concatenated value, ans assigning it to the left-hand-side.
Often, it is faster (but more memory intensive) to append to a list and
then at the end do a u''.join(mylist ). See GvR's essay on optimization
at <http://www.python.org/doc/essays/list2str.html>.

Alternatively, you could use array.array from the Python Library (it's
easy) to get something "just as good as" mutable strings.
_______________ _______________ _______________ _______________ ____________
TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
' <http://www.georgeanels on.com/>
Nov 29 '05 #3
On Tue, Nov 29, 2005 at 09:48:15AM +0100, David Siroky wrote:
Hi!

I need to enlighten myself in Python unicode speed and implementation.

My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.

First a simple example (and time results):

x = "a"*5000000 0
real 0m0.195s
user 0m0.144s
sys 0m0.046s

x = u"a"*5000000 0
real 0m2.477s
user 0m2.119s
sys 0m0.225s

So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?


string objects have the optimization described in the log message below.
The same optimization hasn't been made to unicode_repeat, though it would
probably also benefit from it.

------------------------------------------------------------------------
r30616 | rhettinger | 2003-01-06 04:33:56 -0600 (Mon, 06 Jan 2003) | 11 lines

Optimize string_repeat.

Christian Tismer pointed out the high cost of the loop overhead and
function call overhead for 'c' * n where n is large. Accordingly,
the new code only makes lg2(n) loops.

Interestingly, 'c' * 1000 * 1000 ran a bit faster with old code. At some
point, the loop and function call overhead became cheaper than invalidating
the cache with lengthy memcpys. But for more typical sizes of n, the new
code runs much faster and for larger values of n it runs only a bit slower.
------------------------------------------------------------------------

If you're a "C" coder too, consider creating and submitting a patch to do this
to the patch tracker on http://sf.net/projects/python . That's the best thing
you can do to ensure the optimization is considered for a future release of
Python.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDjNC3Jd0 1MZaTXX0RAhb8AJ wLUv2jNdYPY9Crf H6c1OWpsUTcgwCe Pq8s
cQGDYnsKRdAv6JO 3Zmr3jao=
=906V
-----END PGP SIGNATURE-----

Nov 29 '05 #4
V Tue, 29 Nov 2005 10:14:26 +0000, Neil Hodgson napsal(a):
David Siroky:
output = ''
I suspect you really want "output = u''" here.
for c in line:
if not unicodedata.com bining(c):
output += c


This is creating as many as 50000 new string objects of increasing
size. To build large strings, some common faster techniques are to
either create a list of characters and then use join on the list or use
a cStringIO to accumulate the characters.


That is the answer I wanted, now I'm finally enlightened! :-)

This is about 10 times faster for me:

def no_diacritics(l ine):
if type(line) != unicode:
line = unicode(line, 'utf-8')

line = unicodedata.nor malize('NFKD', line)

output = []
for c in line:
if not unicodedata.com bining(c):
output.append(c )
return u''.join(output )

Neil

Thanx!

David
Nov 30 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
2239
by: Timothy Babytch | last post by:
Imagine you have some list that looks like ('unicode', 'not-acii', 'russian') and contains characters not from acsii. or list of dicts, or dict of dicts. how can I print it? not on by one, with "for" - but with just a simple print? My debugging would be MUCH simpler. Now when I try print or pprint that variable I get a page full of '\xe4\xeb\xa2\xa0\xe6\xe3\xaa\xe6\xe3\xaa' and so on.
2
2623
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is feasible. It would be helpful if people could test the patched Python with their own applications and report any incompatibilities. PEP: 349
2
1508
by: John Perks and Sarah Mount | last post by:
I have to split some identifiers that are casedLikeThis into their component words. In this instance I can safely use to represent uppercase, but what pattern should I use if I wanted it to work more generally? I can envisage walking the string testing the unicodedata.category of each char, but is there a regex'y way to denote "uppercase"? Thanks John
0
1280
by: SPE - Stani's Python Editor | last post by:
What's new? SPE now creates backup files and can insert your standard signature (with for example license and copyright information) in your code. A bug that prevented SPE to start on Linux has been fixed and also a lot of bugfixes were implemented, especially for unicode. You can read more on the SPE news blog. If you like SPE, please contribute by coding, writing documentation or donating. Spread the word on blogs, ...
10
12293
by: Roger Withnell | last post by:
I'm using ASP, VBScript and SQL Server. I'm also using UTF-8 character set and so my codepage is 65001 and SQL Server datatype nvarchar. I can insert unicode characters correctly into the database table using INSERT.... (field1) ...VALUES ......... (N'Characters'). How do I do this using Rs.Update viz-a-viz:
4
3096
by: possibilitybox | last post by:
I'm trying to make a unicode friendly regexp to grab sentences reasonably reliably for as many unicode languages as possible, focusing on european languages first, hence it'd be useful to be able to refer to any uppercase unicode character instead of just the typical , which doesn't include, for example É. Is there a way to do this, or do I have to stick with using the isupper method of the string class?
14
1587
by: willie | last post by:
(beating a dead horse) Is it too ridiculous to suggest that it'd be nice if the unicode object were to remember the encoding of the string it was decoded from? So that it's feasible to calculate the number of bytes that make up the unicode code points. # U+270C # 11100010 10011100 10001100
1
5839
by: erikcw | last post by:
Hi, I'm trying to insert some data from an XML file into MySQL. However, while importing one of the files, I got this error: Traceback (most recent call last): File "wa.py", line 304, in ? main() File "wa.py", line 257, in main curHandler.walkData()
3
2184
by: dmitrey | last post by:
hi all, what's the best way to write Python dictionary to a file? (and then read) There could be unicode field names and values encountered. Thank you in advance, D.
0
8603
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9157
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8861
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7725
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6518
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5860
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
3046
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2329
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2001
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.