Hi!
I need to enlighten myself in Python unicode speed and implementation.
My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.
First a simple example (and time results):
x = "a"*5000000 0
real 0m0.195s
user 0m0.144s
sys 0m0.046s
x = u"a"*5000000 0
real 0m2.477s
user 0m2.119s
sys 0m0.225s
So my first question is why creation of a unicode string lasts more then 10x
longer than non-unicode string?
Another situation: speed problem with long strings
I have a simple function for removing diacritics from a string:
#!/usr/bin/python2.4
# -*- coding: UTF-8 -*-
import unicodedata
def no_diacritics(l ine):
if type(line) != unicode:
line = unicode(line, 'utf-8')
line = unicodedata.nor malize('NFKD', line)
output = ''
for c in line:
if not unicodedata.com bining(c):
output += c
return output
Now the calling sequence (and time results):
for i in xrange(1):
x = u"a"*50000
y = no_diacritics(x )
real 0m17.021s
user 0m11.139s
sys 0m5.116s
for i in xrange(5):
x = u"a"*10000
y = no_diacritics(x )
real 0m0.548s
user 0m0.502s
sys 0m0.004s
In both cases the total amount of data is equal but when I use shorter strings
it is much faster. Maybe it has nothing to do with Python unicode but I would
like to know the reason.
Thanks for notes!
David 4 1949
David Siroky: output = ''
I suspect you really want "output = u''" here.
for c in line: if not unicodedata.com bining(c): output += c
This is creating as many as 50000 new string objects of increasing
size. To build large strings, some common faster techniques are to
either create a list of characters and then use join on the list or use
a cStringIO to accumulate the characters.
This is about 10 times faster for me:
def no_diacritics(l ine):
if type(line) != unicode:
line = unicode(line, 'utf-8')
line = unicodedata.nor malize('NFKD', line)
output = []
for c in line:
if not unicodedata.com bining(c):
output.append(c )
return u''.join(output )
Neil
In article <pa************ *************** *@email.cz>,
David Siroky <ds*****@email. cz> wrote: Hi!
I need to enlighten myself in Python unicode speed and implementation.
My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4.
First a simple example (and time results):
x = "a"*5000000 0 real 0m0.195s user 0m0.144s sys 0m0.046s
x = u"a"*5000000 0 real 0m2.477s user 0m2.119s sys 0m0.225s
So my first question is why creation of a unicode string lasts more then 10x longer than non-unicode string?
Your first example uses about 50 MB. Your second uses about 200 MB, (or
100 MB if your Python is compiled oddly). Check the size of Unicode
chars by: import sys hex(sys.maxunic ode)
If it says '0x10ffff' each unichar uses 4 bytes; if it says '0xffff',
each unichar uses 2 bytes.
Another situation: speed problem with long strings
I have a simple function for removing diacritics from a string:
#!/usr/bin/python2.4 # -*- coding: UTF-8 -*-
import unicodedata
def no_diacritics(l ine): if type(line) != unicode: line = unicode(line, 'utf-8')
line = unicodedata.nor malize('NFKD', line)
output = '' for c in line: if not unicodedata.com bining(c): output += c return output
Now the calling sequence (and time results):
for i in xrange(1): x = u"a"*50000 y = no_diacritics(x )
real 0m17.021s user 0m11.139s sys 0m5.116s
for i in xrange(5): x = u"a"*10000 y = no_diacritics(x )
real 0m0.548s user 0m0.502s sys 0m0.004s
In both cases the total amount of data is equal but when I use shorter strings it is much faster. Maybe it has nothing to do with Python unicode but I would like to know the reason.
It has to do with how strings (either kind) are implemented. Strings
are "immutable" , so string concatination is done by making a new string
that has the concatenated value, ans assigning it to the left-hand-side.
Often, it is faster (but more memory intensive) to append to a list and
then at the end do a u''.join(mylist ). See GvR's essay on optimization
at <http://www.python.org/doc/essays/list2str.html>.
Alternatively, you could use array.array from the Python Library (it's
easy) to get something "just as good as" mutable strings.
_______________ _______________ _______________ _______________ ____________
TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
' <http://www.georgeanels on.com/>
On Tue, Nov 29, 2005 at 09:48:15AM +0100, David Siroky wrote: Hi! I need to enlighten myself in Python unicode speed and implementation. My platform is AMD Athlon@1300 (x86-32), Debian, Python 2.4. First a simple example (and time results): x = "a"*5000000 0 real 0m0.195s user 0m0.144s sys 0m0.046s x = u"a"*5000000 0 real 0m2.477s user 0m2.119s sys 0m0.225s So my first question is why creation of a unicode string lasts more then 10x longer than non-unicode string?
string objects have the optimization described in the log message below.
The same optimization hasn't been made to unicode_repeat, though it would
probably also benefit from it.
------------------------------------------------------------------------
r30616 | rhettinger | 2003-01-06 04:33:56 -0600 (Mon, 06 Jan 2003) | 11 lines
Optimize string_repeat.
Christian Tismer pointed out the high cost of the loop overhead and
function call overhead for 'c' * n where n is large. Accordingly,
the new code only makes lg2(n) loops.
Interestingly, 'c' * 1000 * 1000 ran a bit faster with old code. At some
point, the loop and function call overhead became cheaper than invalidating
the cache with lengthy memcpys. But for more typical sizes of n, the new
code runs much faster and for larger values of n it runs only a bit slower.
------------------------------------------------------------------------
If you're a "C" coder too, consider creating and submitting a patch to do this
to the patch tracker on http://sf.net/projects/python . That's the best thing
you can do to ensure the optimization is considered for a future release of
Python.
Jeff
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
iD8DBQFDjNC3Jd0 1MZaTXX0RAhb8AJ wLUv2jNdYPY9Crf H6c1OWpsUTcgwCe Pq8s
cQGDYnsKRdAv6JO 3Zmr3jao=
=906V
-----END PGP SIGNATURE-----
V Tue, 29 Nov 2005 10:14:26 +0000, Neil Hodgson napsal(a): David Siroky:
output = '' I suspect you really want "output = u''" here.
for c in line: if not unicodedata.com bining(c): output += c
This is creating as many as 50000 new string objects of increasing size. To build large strings, some common faster techniques are to either create a list of characters and then use join on the list or use a cStringIO to accumulate the characters.
That is the answer I wanted, now I'm finally enlightened! :-) This is about 10 times faster for me:
def no_diacritics(l ine): if type(line) != unicode: line = unicode(line, 'utf-8')
line = unicodedata.nor malize('NFKD', line)
output = [] for c in line: if not unicodedata.com bining(c): output.append(c ) return u''.join(output )
Neil
Thanx!
David This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Timothy Babytch |
last post by:
Imagine you have some list that looks like
('unicode', 'not-acii', 'russian') and contains characters not from
acsii. or list of dicts, or dict of dicts.
how can I print it? not on by one, with "for" - but with just a simple
print? My debugging would be MUCH simpler.
Now when I try print or pprint that variable I get a page full of
'\xe4\xeb\xa2\xa0\xe6\xe3\xaa\xe6\xe3\xaa' and so on.
|
by: Neil Schemenauer |
last post by:
python-dev@python.org.]
The PEP has been rewritten based on a suggestion by Guido to change
str() rather than adding a new built-in function. Based on my
testing, I believe the idea is feasible. It would be helpful if
people could test the patched Python with their own applications and
report any incompatibilities.
PEP: 349
|
by: John Perks and Sarah Mount |
last post by:
I have to split some identifiers that are casedLikeThis into their
component words. In this instance I can safely use to represent
uppercase, but what pattern should I use if I wanted it to work more
generally? I can envisage walking the string testing the
unicodedata.category of each char, but is there a regex'y way to denote
"uppercase"?
Thanks
John
|
by: SPE - Stani's Python Editor |
last post by:
What's new?
SPE now creates backup files and can insert your standard signature
(with for example license and copyright information) in your code. A
bug that prevented SPE to start on Linux has been fixed and also a lot
of bugfixes were implemented, especially for unicode.
You can read more on the SPE news blog. If you like SPE, please
contribute by coding, writing documentation or donating. Spread the
word on blogs, ...
|
by: Roger Withnell |
last post by:
I'm using ASP, VBScript and SQL Server.
I'm also using UTF-8 character set and so my codepage is 65001 and SQL
Server datatype nvarchar.
I can insert unicode characters correctly into the database table using
INSERT.... (field1) ...VALUES ......... (N'Characters').
How do I do this using Rs.Update viz-a-viz:
| |
by: possibilitybox |
last post by:
I'm trying to make a unicode friendly regexp to grab sentences
reasonably reliably for as many unicode languages as possible, focusing
on european languages first, hence it'd be useful to be able to refer
to any uppercase unicode character instead of just the typical ,
which doesn't include, for example É. Is there a way to do this, or
do I have to stick with using the isupper method of the string class?
|
by: willie |
last post by:
(beating a dead horse)
Is it too ridiculous to suggest that it'd be nice
if the unicode object were to remember the
encoding of the string it was decoded from?
So that it's feasible to calculate the number
of bytes that make up the unicode code points.
# U+270C
# 11100010 10011100 10001100
|
by: erikcw |
last post by:
Hi,
I'm trying to insert some data from an XML file into MySQL. However,
while importing one of the files, I got this error:
Traceback (most recent call last):
File "wa.py", line 304, in ?
main()
File "wa.py", line 257, in main
curHandler.walkData()
|
by: dmitrey |
last post by:
hi all,
what's the best way to write Python dictionary to a file?
(and then read)
There could be unicode field names and values encountered.
Thank you in advance, D.
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |