japanese encoding iso-2022-jp in python vs. perl

kettle

Hi,
I am rather new to python, and am currently struggling with some
encoding issues. I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
var = var.encode("iso-2022-jp", "replace")
print var
--

I am using the 'replace' argument because there seem to be a couple
of utf-8 japanese characters which python can't correctly convert to
iso-2022-jp. The output looks like this:
â†“æ±äº¬???æ—¥æ¯”è°·ç·š?åŒ—åƒä½è¡Œ

However if use perl's encode module to re-encode the exact same bit
of text:
--
$var = encode("iso-2022-jp", decode("utf8", $var))
print $var
--

I get proper output (no unsightly question-marks):
â†“æ±äº¬ãƒ¡ãƒˆãƒ*æ—¥æ¯”è°·ç·šãƒ»åŒ—åƒä½è¡Œ

So, what's the deal? Why can't python properly encode some of these
characters? I know there are a host of different iso-2022-jp
variants, could it be using a different one than I think (the
default)? I'm quite liking python at the moment for a variety of
different reasons (I suspect perl will forever win when it comes to
regular expressions but everything else is pretty darn nice), but this
is a bit worrying.

-Joe

Oct 23 '07 #1

Subscribe Post Reply

4277

Ryan Ginstrom

On Behalf Of kettle

I am rather new to python, and am currently struggling with some
encoding issues. I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
var = var.encode("iso-2022-jp", "replace")
print var
--

Possibly silly question: Is that a utf-8 string, or Unicode?

print unicode(var, "utf8").encode("iso-2022-jp")

On my computer (Japanese XP), your string round-trips between utf-8 and
iso-2022-jp without problems.

Another possible thing to look at is whether your Python output terminal can
print Japanese OK. Does it choke when printing the string as Unicode?

Regards,
Ryan Ginstrom

Oct 23 '07 #2

=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=

var = var.encode("iso-2022-jp", "replace")

print var

[...]

â†“æ±äº¬ãƒ¡ãƒˆãƒ*æ—¥æ¯”è°·ç·šãƒ»åŒ—åƒä½è¡Œ

So, what's the deal? Why can't python properly encode some of these
characters?

It's not clear. As Ryan says, it works just fine (and so it does for
me with Python 2.4.4 on Debian).

What Python version are you using, and what is the precise string that
you want to encode? (use "print repr(var)" to report that exact value)

HTH,
Martin

Oct 23 '07 #3

Leo Kislov

On Oct 23, 3:37Â*am, kettle <Josef.Robert.No...@gmail.comwrote:

Hi,
Â* I am rather new to python, and am currently struggling with some
encoding issues. Â*I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
Â*var = var.encode("iso-2022-jp", "replace")
Â*print var
--

Â*I am using the 'replace' argument because there seem to be a couple
of utf-8 japanese characters which python can't correctly convert to
iso-2022-jp. Â*The output looks like this:
â†“æ±äº¬???æ—¥æ¯”è°·ç·š?åŒ—åƒä½è¡Œ

Â*However if use perl's encode module to re-encode the exact same bit
of text:
--
Â*$var = encode("iso-2022-jp", decode("utf8", $var))
Â*print $var
--

Â*I get proper output (no unsightly question-marks):
â†“æ±äº¬ãƒ¡ãƒˆãƒ*æ—¥æ¯”è°·ç·šãƒ»åŒ—åƒä½è¡Œ

So, what's the deal? Â*

Thanks that I have my crystal ball working. I can see clearly that the
forth
character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92)
which is
not present in ISO-2022-JP as defined by RFC 1468 so python converts
it into
question mark as you requested. Meanwhile perl as usual is trying to
guess what
you want and silently converts that character into 'KATAKANA LETTER
ME' (U+30E1)
which is present in ISO-2022-JP.

Why can't python properly encode some of these
characters?

Because "Explicit is better than implicit". Do you care about
roundtripping?
Do you care about width of characters? What about full-width " (U
+FF02)? Python
doesn't know answers to these questions so it doesn't do anything with
your
input. You have to do it yourself. Assuming you don't care about
roundtripping
and width here is an example demonstrating how to deal with narrow
characters:

from unicodedata import normalize
iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in
range(0xFF61,0xFFE0))
print repr(u'\uFF92'.translate(iso2022_squeezing))

It prints u'\u30e1'. Feel free to ask questions if something is not
clear.

Note, this is just an example, I *don't* claim it does what you want
for any character
in FF61-FFDF range. You may want to carefully review the whole unicode
block:
http://www.unicode.org/charts/PDF/UFF00.pdf

-- Leo.

Oct 24 '07 #4

kettle

Thanks Leo, and everyone else, these were very helpful replies. The
issue was exactly as Leo described, and I apologize for not being
aware of it, and thus not quite reporting it correctly.

At the moment I don't care about round-tripping between half-width and
full-width kana, rather I need only be able to rely on any particular
kana character be translated correctly to its half-width or full-width
equivalent, and I need the Japanese I send out to be readable.

I appreciate the 'implicit versus explicit' point, and have read about
it in a few different python mailing lists. In this instance it seems
that perl perhaps ought to flash a warning notification regarding what
it is doing, but as this conversion between half-width and full-width
characters is by far the most logical one available, it also seems
reasonable that python might perhaps include such capabilities by
default, just as it currently includes the 'replace' option for
mapping missed characters generically to '?'.

I still haven't worked out the entire mapping routine, but Leo's hint
is probably sufficient to get it working with a bit more effort.

Again, thanks for the help.

-Joe

Thanks that I have my crystal ball working. I can see clearly that the
forth
character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92)
which is
not present in ISO-2022-JP as defined by RFC 1468 so python converts
it into
question mark as you requested. Meanwhile perl as usual is trying to
guess what
you want and silently converts that character into 'KATAKANA LETTER
ME' (U+30E1)
which is present in ISO-2022-JP.

Why can't python properly encode some of these
characters?

Because "Explicit is better than implicit". Do you care about
roundtripping?
Do you care about width of characters? What about full-width " (U
+FF02)? Python
doesn't know answers to these questions so it doesn't do anything with
your
input. You have to do it yourself. Assuming you don't care about
roundtripping
and width here is an example demonstrating how to deal with narrow
characters:

from unicodedata import normalize
iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in
range(0xFF61,0xFFE0))
print repr(u'\uFF92'.translate(iso2022_squeezing))

It prints u'\u30e1'. Feel free to ask questions if something is not
clear.

Note, this is just an example, I *don't* claim it does what you want
for any character
in FF61-FFDF range. You may want to carefully review the whole unicode
block:http://www.unicode.org/charts/PDF/UFF00.pdf

-- Leo.

Oct 24 '07 #5

by: David Thomas | last post by:

Hi there, a while ago, I posted a question regarding reading japanese text from a text file. Well, since I solved the problem, I thought I'd post my solution for the benefit of other people with...

PHP

Chinese and Japanese characters in same colation

by: GPenn | last post by:

SQL 2000, latest SP. We currently have the need to store data from a UTF-8 application in multiple languages in a single database. Our findings thus far support the fact that single-byte and...

Microsoft SQL Server

Errors parsing Japanese chars

by: Sriv Chakravarthy | last post by:

I am trying to use xerces-c SAX parser to parse japanese characters. I have a <?xml... utf-8> line in the xml file. When the parser encounters the jap characters it throws a UTFDataFormatException....

.NET Framework

Transcode Japanese??

by: Robert M. Gary | last post by:

I'm on a Solaris 9 Japanese machine w/ an Ultra 5 Sparc CPU. I'm using Xerces 2.6 DOM I've got a document in UTF-8 format.. <?xml version="1.0" encoding="UTF-8"?>...

.NET Framework

displaying japanese text on English OS

by: Benoit Martin | last post by:

in my windows app, I have some japanese text that I load from a text file and display on a label. No matter what type of encoding I try to use on the text file, the text always comes up as a bunch...

.NET Framework

japanese gaming encoding problem

by: jim figurski | last post by:

Hi, I have an American computer using windows XP. I recently bought a japanese game to help me learn japanese as I play. I installed the game sucessfully, but the letters are not displayed in...

C# / C Sharp

Prototype, Safari and Japanese problems?

by: Doug Lerner | last post by:

I'm working on a client/server app that seems to work fine in OS Firefox and Windows IE and Firefox. However, in OS X Safari, although the UI/communications themselves work fine, if the...

Javascript

Japanese - wrong encoding in a frame

by: paulgor | last post by:

Hi, May be it's a know issue but my search brought nothing... We have static HTML files with Japanese text in UTF-8 encoding - it's on-line Help for our application, so there are no Web...

Javascript

Japanese character problem in mail subject line

by: PHP Wooer | last post by:

Can anybody there please help me out? I am having a problem with the display of Japanese character in the subject line of the mails sent in Japanese language. This problem is particularly with...

PHP

Japanese and Russian Text Encoding Problem

by: sandeepindia | last post by:

I m using PHP Version 4.3.2 & Mysql version 3.23.58(No collation etc). My site is in shift JIS encoding. I've both japanese users & russian users. They have entered their member profile in their...

PHP

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

japanese encoding iso-2022-jp in python vs. perl

Similar topics