472,126 Members | 1,572 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,126 software developers and data experts.

japanese encoding iso-2022-jp in python vs. perl

Hi,
I am rather new to python, and am currently struggling with some
encoding issues. I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
var = var.encode("iso-2022-jp", "replace")
print var
--

I am using the 'replace' argument because there seem to be a couple
of utf-8 japanese characters which python can't correctly convert to
iso-2022-jp. The output looks like this:
↓東京???日比谷線?北千住行

However if use perl's encode module to re-encode the exact same bit
of text:
--
$var = encode("iso-2022-jp", decode("utf8", $var))
print $var
--

I get proper output (no unsightly question-marks):
↓東京メト*日比谷線・北千住行

So, what's the deal? Why can't python properly encode some of these
characters? I know there are a host of different iso-2022-jp
variants, could it be using a different one than I think (the
default)? I'm quite liking python at the moment for a variety of
different reasons (I suspect perl will forever win when it comes to
regular expressions but everything else is pretty darn nice), but this
is a bit worrying.

-Joe

Oct 23 '07 #1
4 4063
On Behalf Of kettle
I am rather new to python, and am currently struggling with some
encoding issues. I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
var = var.encode("iso-2022-jp", "replace")
print var
--
Possibly silly question: Is that a utf-8 string, or Unicode?

print unicode(var, "utf8").encode("iso-2022-jp")

On my computer (Japanese XP), your string round-trips between utf-8 and
iso-2022-jp without problems.

Another possible thing to look at is whether your Python output terminal can
print Japanese OK. Does it choke when printing the string as Unicode?

Regards,
Ryan Ginstrom

Oct 23 '07 #2
var = var.encode("iso-2022-jp", "replace")
print var
[...]
↓東京メト*日比谷線・北千住行

So, what's the deal? Why can't python properly encode some of these
characters?
It's not clear. As Ryan says, it works just fine (and so it does for
me with Python 2.4.4 on Debian).

What Python version are you using, and what is the precise string that
you want to encode? (use "print repr(var)" to report that exact value)

HTH,
Martin
Oct 23 '07 #3
On Oct 23, 3:37*am, kettle <Josef.Robert.No...@gmail.comwrote:
Hi,
* I am rather new to python, and am currently struggling with some
encoding issues. *I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
*var = var.encode("iso-2022-jp", "replace")
*print var
--

*I am using the 'replace' argument because there seem to be a couple
of utf-8 japanese characters which python can't correctly convert to
iso-2022-jp. *The output looks like this:
↓東京???日比谷線?北千住行

*However if use perl's encode module to re-encode the exact same bit
of text:
--
*$var = encode("iso-2022-jp", decode("utf8", $var))
*print $var
--

*I get proper output (no unsightly question-marks):
↓東京メト*日比谷線・北千住行

So, what's the deal? *
Thanks that I have my crystal ball working. I can see clearly that the
forth
character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92)
which is
not present in ISO-2022-JP as defined by RFC 1468 so python converts
it into
question mark as you requested. Meanwhile perl as usual is trying to
guess what
you want and silently converts that character into 'KATAKANA LETTER
ME' (U+30E1)
which is present in ISO-2022-JP.
Why can't python properly encode some of these
characters?
Because "Explicit is better than implicit". Do you care about
roundtripping?
Do you care about width of characters? What about full-width " (U
+FF02)? Python
doesn't know answers to these questions so it doesn't do anything with
your
input. You have to do it yourself. Assuming you don't care about
roundtripping
and width here is an example demonstrating how to deal with narrow
characters:

from unicodedata import normalize
iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in
range(0xFF61,0xFFE0))
print repr(u'\uFF92'.translate(iso2022_squeezing))

It prints u'\u30e1'. Feel free to ask questions if something is not
clear.

Note, this is just an example, I *don't* claim it does what you want
for any character
in FF61-FFDF range. You may want to carefully review the whole unicode
block:
http://www.unicode.org/charts/PDF/UFF00.pdf

-- Leo.

Oct 24 '07 #4
Thanks Leo, and everyone else, these were very helpful replies. The
issue was exactly as Leo described, and I apologize for not being
aware of it, and thus not quite reporting it correctly.

At the moment I don't care about round-tripping between half-width and
full-width kana, rather I need only be able to rely on any particular
kana character be translated correctly to its half-width or full-width
equivalent, and I need the Japanese I send out to be readable.

I appreciate the 'implicit versus explicit' point, and have read about
it in a few different python mailing lists. In this instance it seems
that perl perhaps ought to flash a warning notification regarding what
it is doing, but as this conversion between half-width and full-width
characters is by far the most logical one available, it also seems
reasonable that python might perhaps include such capabilities by
default, just as it currently includes the 'replace' option for
mapping missed characters generically to '?'.

I still haven't worked out the entire mapping routine, but Leo's hint
is probably sufficient to get it working with a bit more effort.

Again, thanks for the help.

-Joe
Thanks that I have my crystal ball working. I can see clearly that the
forth
character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92)
which is
not present in ISO-2022-JP as defined by RFC 1468 so python converts
it into
question mark as you requested. Meanwhile perl as usual is trying to
guess what
you want and silently converts that character into 'KATAKANA LETTER
ME' (U+30E1)
which is present in ISO-2022-JP.
Why can't python properly encode some of these
characters?

Because "Explicit is better than implicit". Do you care about
roundtripping?
Do you care about width of characters? What about full-width " (U
+FF02)? Python
doesn't know answers to these questions so it doesn't do anything with
your
input. You have to do it yourself. Assuming you don't care about
roundtripping
and width here is an example demonstrating how to deal with narrow
characters:

from unicodedata import normalize
iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in
range(0xFF61,0xFFE0))
print repr(u'\uFF92'.translate(iso2022_squeezing))

It prints u'\u30e1'. Feel free to ask questions if something is not
clear.

Note, this is just an example, I *don't* claim it does what you want
for any character
in FF61-FFDF range. You may want to carefully review the whole unicode
block:http://www.unicode.org/charts/PDF/UFF00.pdf

-- Leo.

Oct 24 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by David Thomas | last post: by
1 post views Thread by Sriv Chakravarthy | last post: by
2 posts views Thread by Robert M. Gary | last post: by
3 posts views Thread by Benoit Martin | last post: by
1 post views Thread by jim figurski | last post: by
21 posts views Thread by Doug Lerner | last post: by
3 posts views Thread by paulgor | last post: by
1 post views Thread by PHP Wooer | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.