473,409 Members | 1,945 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,409 software developers and data experts.

japanese encoding iso-2022-jp in python vs. perl

Hi,
I am rather new to python, and am currently struggling with some
encoding issues. I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
var = var.encode("iso-2022-jp", "replace")
print var
--

I am using the 'replace' argument because there seem to be a couple
of utf-8 japanese characters which python can't correctly convert to
iso-2022-jp. The output looks like this:
↓東京???日比谷線?北千住行

However if use perl's encode module to re-encode the exact same bit
of text:
--
$var = encode("iso-2022-jp", decode("utf8", $var))
print $var
--

I get proper output (no unsightly question-marks):
↓東京メト*日比谷線・北千住行

So, what's the deal? Why can't python properly encode some of these
characters? I know there are a host of different iso-2022-jp
variants, could it be using a different one than I think (the
default)? I'm quite liking python at the moment for a variety of
different reasons (I suspect perl will forever win when it comes to
regular expressions but everything else is pretty darn nice), but this
is a bit worrying.

-Joe

Oct 23 '07 #1
4 4277
On Behalf Of kettle
I am rather new to python, and am currently struggling with some
encoding issues. I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
var = var.encode("iso-2022-jp", "replace")
print var
--
Possibly silly question: Is that a utf-8 string, or Unicode?

print unicode(var, "utf8").encode("iso-2022-jp")

On my computer (Japanese XP), your string round-trips between utf-8 and
iso-2022-jp without problems.

Another possible thing to look at is whether your Python output terminal can
print Japanese OK. Does it choke when printing the string as Unicode?

Regards,
Ryan Ginstrom

Oct 23 '07 #2
var = var.encode("iso-2022-jp", "replace")
print var
[...]
↓東京メト*日比谷線・北千住行

So, what's the deal? Why can't python properly encode some of these
characters?
It's not clear. As Ryan says, it works just fine (and so it does for
me with Python 2.4.4 on Debian).

What Python version are you using, and what is the precise string that
you want to encode? (use "print repr(var)" to report that exact value)

HTH,
Martin
Oct 23 '07 #3
On Oct 23, 3:37*am, kettle <Josef.Robert.No...@gmail.comwrote:
Hi,
* I am rather new to python, and am currently struggling with some
encoding issues. *I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
*var = var.encode("iso-2022-jp", "replace")
*print var
--

*I am using the 'replace' argument because there seem to be a couple
of utf-8 japanese characters which python can't correctly convert to
iso-2022-jp. *The output looks like this:
↓東京???日比谷線?北千住行

*However if use perl's encode module to re-encode the exact same bit
of text:
--
*$var = encode("iso-2022-jp", decode("utf8", $var))
*print $var
--

*I get proper output (no unsightly question-marks):
↓東京メト*日比谷線・北千住行

So, what's the deal? *
Thanks that I have my crystal ball working. I can see clearly that the
forth
character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92)
which is
not present in ISO-2022-JP as defined by RFC 1468 so python converts
it into
question mark as you requested. Meanwhile perl as usual is trying to
guess what
you want and silently converts that character into 'KATAKANA LETTER
ME' (U+30E1)
which is present in ISO-2022-JP.
Why can't python properly encode some of these
characters?
Because "Explicit is better than implicit". Do you care about
roundtripping?
Do you care about width of characters? What about full-width " (U
+FF02)? Python
doesn't know answers to these questions so it doesn't do anything with
your
input. You have to do it yourself. Assuming you don't care about
roundtripping
and width here is an example demonstrating how to deal with narrow
characters:

from unicodedata import normalize
iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in
range(0xFF61,0xFFE0))
print repr(u'\uFF92'.translate(iso2022_squeezing))

It prints u'\u30e1'. Feel free to ask questions if something is not
clear.

Note, this is just an example, I *don't* claim it does what you want
for any character
in FF61-FFDF range. You may want to carefully review the whole unicode
block:
http://www.unicode.org/charts/PDF/UFF00.pdf

-- Leo.

Oct 24 '07 #4
Thanks Leo, and everyone else, these were very helpful replies. The
issue was exactly as Leo described, and I apologize for not being
aware of it, and thus not quite reporting it correctly.

At the moment I don't care about round-tripping between half-width and
full-width kana, rather I need only be able to rely on any particular
kana character be translated correctly to its half-width or full-width
equivalent, and I need the Japanese I send out to be readable.

I appreciate the 'implicit versus explicit' point, and have read about
it in a few different python mailing lists. In this instance it seems
that perl perhaps ought to flash a warning notification regarding what
it is doing, but as this conversion between half-width and full-width
characters is by far the most logical one available, it also seems
reasonable that python might perhaps include such capabilities by
default, just as it currently includes the 'replace' option for
mapping missed characters generically to '?'.

I still haven't worked out the entire mapping routine, but Leo's hint
is probably sufficient to get it working with a bit more effort.

Again, thanks for the help.

-Joe
Thanks that I have my crystal ball working. I can see clearly that the
forth
character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92)
which is
not present in ISO-2022-JP as defined by RFC 1468 so python converts
it into
question mark as you requested. Meanwhile perl as usual is trying to
guess what
you want and silently converts that character into 'KATAKANA LETTER
ME' (U+30E1)
which is present in ISO-2022-JP.
Why can't python properly encode some of these
characters?

Because "Explicit is better than implicit". Do you care about
roundtripping?
Do you care about width of characters? What about full-width " (U
+FF02)? Python
doesn't know answers to these questions so it doesn't do anything with
your
input. You have to do it yourself. Assuming you don't care about
roundtripping
and width here is an example demonstrating how to deal with narrow
characters:

from unicodedata import normalize
iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in
range(0xFF61,0xFFE0))
print repr(u'\uFF92'.translate(iso2022_squeezing))

It prints u'\u30e1'. Feel free to ask questions if something is not
clear.

Note, this is just an example, I *don't* claim it does what you want
for any character
in FF61-FFDF range. You may want to carefully review the whole unicode
block:http://www.unicode.org/charts/PDF/UFF00.pdf

-- Leo.

Oct 24 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: David Thomas | last post by:
Hi there, a while ago, I posted a question regarding reading japanese text from a text file. Well, since I solved the problem, I thought I'd post my solution for the benefit of other people with...
1
by: GPenn | last post by:
SQL 2000, latest SP. We currently have the need to store data from a UTF-8 application in multiple languages in a single database. Our findings thus far support the fact that single-byte and...
1
by: Sriv Chakravarthy | last post by:
I am trying to use xerces-c SAX parser to parse japanese characters. I have a <?xml... utf-8> line in the xml file. When the parser encounters the jap characters it throws a UTFDataFormatException....
2
by: Robert M. Gary | last post by:
I'm on a Solaris 9 Japanese machine w/ an Ultra 5 Sparc CPU. I'm using Xerces 2.6 DOM I've got a document in UTF-8 format.. <?xml version="1.0" encoding="UTF-8"?>...
3
by: Benoit Martin | last post by:
in my windows app, I have some japanese text that I load from a text file and display on a label. No matter what type of encoding I try to use on the text file, the text always comes up as a bunch...
1
by: jim figurski | last post by:
Hi, I have an American computer using windows XP. I recently bought a japanese game to help me learn japanese as I play. I installed the game sucessfully, but the letters are not displayed in...
21
by: Doug Lerner | last post by:
I'm working on a client/server app that seems to work fine in OS Firefox and Windows IE and Firefox. However, in OS X Safari, although the UI/communications themselves work fine, if the...
3
by: paulgor | last post by:
Hi, May be it's a know issue but my search brought nothing... We have static HTML files with Japanese text in UTF-8 encoding - it's on-line Help for our application, so there are no Web...
1
by: PHP Wooer | last post by:
Can anybody there please help me out? I am having a problem with the display of Japanese character in the subject line of the mails sent in Japanese language. This problem is particularly with...
1
by: sandeepindia | last post by:
I m using PHP Version 4.3.2 & Mysql version 3.23.58(No collation etc). My site is in shift JIS encoding. I've both japanese users & russian users. They have entered their member profile in their...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.