473,795 Members | 2,425 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Unicode charmap decoders slow

Is there a faster way to decode from charmaps to utf-8 than unicode()?

I'm writing a small card-file program. As a test, I use a 53 MB MBox
file, in mac-roman encoding. My program reads and parses the file into
messages in about 3..5 seconds, but takes about 13.5 seconds to iterate
over the cards and convert them to utf-8:

for i in xrange(len(card s)):
u = unicode(cards[i], encoding)
cards[i] = u.encode('utf-_8')

The time is nearly all in the unicode() call. It's not so much how much
time it takes, but that it takes 4 times as long as the real work, just
to do table lookups.

Looking at the source (which, if I have it right, is
PyUnicode_Decod eCharmap() in unicodeobject.c ), I think it is doing a
dictionary lookup for each character. I would have thought that it
would make and cache a LUT the size of the charmap (and hook the
relevent dictionary stuff to delete the cached LUT if the dictionary is
changed).

I thought of using U"".translate() , but the unicode version is defined
to be slow. Is there some similar approach? I'm almost (but not quite)
ready to try it in Pyrex.

I'm new to Python. I didn't google anything relevent on python.org or
in groups.
_______________ _______________ _______________ _______________ ____________
TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
' <http://www.georgeanels on.com/>
Oct 3 '05 #1
4 2018
Tony Nelson wrote:
Is there a faster way to decode from charmaps to utf-8 than unicode()?


You could try the iconv codec, if your system supports iconv:

http://cvs.sourceforge.net/viewcvs.p...ecodecs/iconv/

Regards,
Martin
Oct 3 '05 #2
In article <43************ **********@news .freenet.de>,
"Martin v. Löwis" <ma****@v.loewi s.de> wrote:
Tony Nelson wrote:
Is there a faster way to decode from charmaps to utf-8 than unicode()?


You could try the iconv codec, if your system supports iconv:

http://cvs.sourceforge.net/viewcvs.p...ecodecs/iconv/


I had seen iconv. Even if my system supports it and it is faster than
Python's charmap decoder, it might not be available on other systems.
Requiring something unusual in order to do a trivial LUT task isn't an
acceptable solution. If I write a charmap decoder as an extension
module in Pyrex I can include it with the program. I would prefer a
solution that doesn't even need that, preferably in pure Python. Since
Python does all the hard wark so fast it certainly could do it, and it
can almost do it with "".translat e().
_______________ _______________ _______________ _______________ ____________
TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
' <http://www.georgeanels on.com/>
Oct 3 '05 #3
Tony Nelson wrote:
I had seen iconv. Even if my system supports it and it is faster than
Python's charmap decoder, it might not be available on other systems.
Requiring something unusual in order to do a trivial LUT task isn't an
acceptable solution. If I write a charmap decoder as an extension
module in Pyrex I can include it with the program. I would prefer a
solution that doesn't even need that, preferably in pure Python. Since
Python does all the hard wark so fast it certainly could do it, and it
can almost do it with "".translat e().


Well, did you try a pure-Python version yourself?

table = [chr(i).decode(" mac-roman","replace ") for i in range(256)]

def decode_mac_roma n(s):
result = [table[ord(c)] for c in s]
return u"".join(result )

How much faster than the standard codec is that?

Regards,
Martin
Oct 3 '05 #4
In article <43************ ***********@new s.freenet.de>,
"Martin v. Löwis" <ma****@v.loewi s.de> wrote:
Tony Nelson wrote:
I had seen iconv. Even if my system supports it and it is faster than
Python's charmap decoder, it might not be available on other systems.
Requiring something unusual in order to do a trivial LUT task isn't an
acceptable solution. If I write a charmap decoder as an extension
module in Pyrex I can include it with the program. I would prefer a
solution that doesn't even need that, preferably in pure Python. Since
Python does all the hard wark so fast it certainly could do it, and it
can almost do it with "".translat e().


Well, did you try a pure-Python version yourself?

table = [chr(i).decode(" mac-roman","replace ") for i in range(256)]

def decode_mac_roma n(s):
result = [table[ord(c)] for c in s]
return u"".join(result )

How much faster than the standard codec is that?


It's .18x faster.
_______________ _______________ _______________ _______________ ____________
TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
' <http://www.georgeanels on.com/>
Oct 3 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
549
by: Quimbly | last post by:
Hi, I'm just trying to do something very simple, but I can't find a simple example. In a TextBox and Button on a WinForm, I just want to display some Unicode characters -- specifically, U+2660, U+2663, U+2665 and U+2666 (the playing card symbols). It's my understanding that all strings is .NET are unicode strings. Is that correct? Currently I have the unicode characters as input to some strings:
1
3209
by: Mark Johnson | last post by:
I wonder if anyone has a solution? I wanted to use the web browser control as a 'zoom' box for a smaller textbox. I can format in the control, and save whatever formatting as HTML code back to the textbox when the web browser is closed. The only problem comes in the use of numeric entities to specify Unicode. The web browser control is fine when it comes to named entities, like &nbsp; . And there are a lot of named entities, just...
0
2884
by: Mark Johnson | last post by:
The last reply got sort of cutoff. So here again: So for anyone interested, here's the simple regexp patterns for the substitutions required. The textbox control is being 'zoomed' in a popup which uses a web browser control. As soon as any numeric entity gets put into the browser control, it's lost. It will display properly. But it can't be then read back out with document.body.innerHTML (or outerHTML). It's
4
3880
by: Greg | last post by:
I'm trying to write a basic tool to convert strings to unicode encodings. Should be easy enough, I can do the encoding bit with the various encoding tools in C#, but what I can't seem to do is force C# to spit out the encodings - everything I've done just seems to decode the unicode and spit out exactly what I had typed in. How do I go from a byte of unicode bytes to a string? (As you might guess, I'm not really much of a coder - even...
3
6245
by: Thomas Heller | last post by:
I'm using code.Interactive console but it doesn't work correctly with non-ascii characters. I think it boils down to this problem: Python 2.4.3 (#69, Mar 29 2006, 17:35:34) on win32 Type "help", "copyright", "credits" or "license" for more information. >>> print u"ä" ä >>> exec 'print u"ä"' Traceback (most recent call last): File "<stdin>", line 1, in ?
22
6088
by: Filipe | last post by:
Hi all, I'm starting to learn python but am having some difficulties with how it handles the encoding of data I'm reading from a database. I'm using pymssql to access data stored in a SqlServer database, and the following is the script I'm using for testing purposes. ----------------------------------------------------------------------------- import pymssql
1
3602
by: Jimmy Stewart | last post by:
the character map that comes with windows only gives keystrokes in unicode and not ansi (i.e. alt+####) what gives. How can I use those codes to insert special characters. Also, why cant I copy an paste all of the characters from the character map. I tried several selections from the windings set and a different character appears.
7
3515
by: aine_canby | last post by:
Hi, Im totally new to Python so please bare with me. Data is entered into my program using the folling code - str = raw_input(command) words = str.split() for word in words:
1
3863
by: Mudcat | last post by:
In short what I'm trying to do is read a document using an xml parser and then upload that data back into a database. I've got the code more or less completed using xml.etree.ElementTree for the parser and dbi/ odbc for my db connection. To fix problems with unicode I built a work-around by mapping unicode characters to equivalent ascii characters and then encoding everything to ascii. That allowed me to build the application and debug...
0
9672
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10214
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10164
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9042
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7538
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5437
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5563
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3723
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2920
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.