473,395 Members | 1,581 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Unicode charmap decoders slow

Is there a faster way to decode from charmaps to utf-8 than unicode()?

I'm writing a small card-file program. As a test, I use a 53 MB MBox
file, in mac-roman encoding. My program reads and parses the file into
messages in about 3..5 seconds, but takes about 13.5 seconds to iterate
over the cards and convert them to utf-8:

for i in xrange(len(cards)):
u = unicode(cards[i], encoding)
cards[i] = u.encode('utf-_8')

The time is nearly all in the unicode() call. It's not so much how much
time it takes, but that it takes 4 times as long as the real work, just
to do table lookups.

Looking at the source (which, if I have it right, is
PyUnicode_DecodeCharmap() in unicodeobject.c), I think it is doing a
dictionary lookup for each character. I would have thought that it
would make and cache a LUT the size of the charmap (and hook the
relevent dictionary stuff to delete the cached LUT if the dictionary is
changed).

I thought of using U"".translate(), but the unicode version is defined
to be slow. Is there some similar approach? I'm almost (but not quite)
ready to try it in Pyrex.

I'm new to Python. I didn't google anything relevent on python.org or
in groups.
__________________________________________________ ______________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>
Oct 3 '05 #1
4 2005
Tony Nelson wrote:
Is there a faster way to decode from charmaps to utf-8 than unicode()?


You could try the iconv codec, if your system supports iconv:

http://cvs.sourceforge.net/viewcvs.p...ecodecs/iconv/

Regards,
Martin
Oct 3 '05 #2
In article <43**********************@news.freenet.de>,
"Martin v. Löwis" <ma****@v.loewis.de> wrote:
Tony Nelson wrote:
Is there a faster way to decode from charmaps to utf-8 than unicode()?


You could try the iconv codec, if your system supports iconv:

http://cvs.sourceforge.net/viewcvs.p...ecodecs/iconv/


I had seen iconv. Even if my system supports it and it is faster than
Python's charmap decoder, it might not be available on other systems.
Requiring something unusual in order to do a trivial LUT task isn't an
acceptable solution. If I write a charmap decoder as an extension
module in Pyrex I can include it with the program. I would prefer a
solution that doesn't even need that, preferably in pure Python. Since
Python does all the hard wark so fast it certainly could do it, and it
can almost do it with "".translate().
__________________________________________________ ______________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>
Oct 3 '05 #3
Tony Nelson wrote:
I had seen iconv. Even if my system supports it and it is faster than
Python's charmap decoder, it might not be available on other systems.
Requiring something unusual in order to do a trivial LUT task isn't an
acceptable solution. If I write a charmap decoder as an extension
module in Pyrex I can include it with the program. I would prefer a
solution that doesn't even need that, preferably in pure Python. Since
Python does all the hard wark so fast it certainly could do it, and it
can almost do it with "".translate().


Well, did you try a pure-Python version yourself?

table = [chr(i).decode("mac-roman","replace") for i in range(256)]

def decode_mac_roman(s):
result = [table[ord(c)] for c in s]
return u"".join(result)

How much faster than the standard codec is that?

Regards,
Martin
Oct 3 '05 #4
In article <43***********************@news.freenet.de>,
"Martin v. Löwis" <ma****@v.loewis.de> wrote:
Tony Nelson wrote:
I had seen iconv. Even if my system supports it and it is faster than
Python's charmap decoder, it might not be available on other systems.
Requiring something unusual in order to do a trivial LUT task isn't an
acceptable solution. If I write a charmap decoder as an extension
module in Pyrex I can include it with the program. I would prefer a
solution that doesn't even need that, preferably in pure Python. Since
Python does all the hard wark so fast it certainly could do it, and it
can almost do it with "".translate().


Well, did you try a pure-Python version yourself?

table = [chr(i).decode("mac-roman","replace") for i in range(256)]

def decode_mac_roman(s):
result = [table[ord(c)] for c in s]
return u"".join(result)

How much faster than the standard codec is that?


It's .18x faster.
__________________________________________________ ______________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>
Oct 3 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Quimbly | last post by:
Hi, I'm just trying to do something very simple, but I can't find a simple example. In a TextBox and Button on a WinForm, I just want to display some Unicode characters -- specifically, U+2660,...
1
by: Mark Johnson | last post by:
I wonder if anyone has a solution? I wanted to use the web browser control as a 'zoom' box for a smaller textbox. I can format in the control, and save whatever formatting as HTML code back to the...
0
by: Mark Johnson | last post by:
The last reply got sort of cutoff. So here again: So for anyone interested, here's the simple regexp patterns for the substitutions required. The textbox control is being 'zoomed' in a popup...
4
by: Greg | last post by:
I'm trying to write a basic tool to convert strings to unicode encodings. Should be easy enough, I can do the encoding bit with the various encoding tools in C#, but what I can't seem to do is...
3
by: Thomas Heller | last post by:
I'm using code.Interactive console but it doesn't work correctly with non-ascii characters. I think it boils down to this problem: Python 2.4.3 (#69, Mar 29 2006, 17:35:34) on win32 Type...
22
by: Filipe | last post by:
Hi all, I'm starting to learn python but am having some difficulties with how it handles the encoding of data I'm reading from a database. I'm using pymssql to access data stored in a SqlServer...
1
by: Jimmy Stewart | last post by:
the character map that comes with windows only gives keystrokes in unicode and not ansi (i.e. alt+####) what gives. How can I use those codes to insert special characters. Also, why cant I copy an...
7
by: aine_canby | last post by:
Hi, Im totally new to Python so please bare with me. Data is entered into my program using the folling code - str = raw_input(command) words = str.split() for word in words:
1
by: Mudcat | last post by:
In short what I'm trying to do is read a document using an xml parser and then upload that data back into a database. I've got the code more or less completed using xml.etree.ElementTree for the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.