473,289 Members | 2,155 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,289 software developers and data experts.

japanese encoding iso-2022-jp in python vs. perl

Hi,
I am rather new to python, and am currently struggling with some
encoding issues. I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
var = var.encode("iso-2022-jp", "replace")
print var
--

I am using the 'replace' argument because there seem to be a couple
of utf-8 japanese characters which python can't correctly convert to
iso-2022-jp. The output looks like this:
↓東京???日比谷線?北千住行

However if use perl's encode module to re-encode the exact same bit
of text:
--
$var = encode("iso-2022-jp", decode("utf8", $var))
print $var
--

I get proper output (no unsightly question-marks):
↓東京メト*日比谷線・北千住行

So, what's the deal? Why can't python properly encode some of these
characters? I know there are a host of different iso-2022-jp
variants, could it be using a different one than I think (the
default)? I'm quite liking python at the moment for a variety of
different reasons (I suspect perl will forever win when it comes to
regular expressions but everything else is pretty darn nice), but this
is a bit worrying.

-Joe

Oct 23 '07 #1
4 4259
On Behalf Of kettle
I am rather new to python, and am currently struggling with some
encoding issues. I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
var = var.encode("iso-2022-jp", "replace")
print var
--
Possibly silly question: Is that a utf-8 string, or Unicode?

print unicode(var, "utf8").encode("iso-2022-jp")

On my computer (Japanese XP), your string round-trips between utf-8 and
iso-2022-jp without problems.

Another possible thing to look at is whether your Python output terminal can
print Japanese OK. Does it choke when printing the string as Unicode?

Regards,
Ryan Ginstrom

Oct 23 '07 #2
var = var.encode("iso-2022-jp", "replace")
print var
[...]
↓東京メト*日比谷線・北千住行

So, what's the deal? Why can't python properly encode some of these
characters?
It's not clear. As Ryan says, it works just fine (and so it does for
me with Python 2.4.4 on Debian).

What Python version are you using, and what is the precise string that
you want to encode? (use "print repr(var)" to report that exact value)

HTH,
Martin
Oct 23 '07 #3
On Oct 23, 3:37*am, kettle <Josef.Robert.No...@gmail.comwrote:
Hi,
* I am rather new to python, and am currently struggling with some
encoding issues. *I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
*var = var.encode("iso-2022-jp", "replace")
*print var
--

*I am using the 'replace' argument because there seem to be a couple
of utf-8 japanese characters which python can't correctly convert to
iso-2022-jp. *The output looks like this:
↓東京???日比谷線?北千住行

*However if use perl's encode module to re-encode the exact same bit
of text:
--
*$var = encode("iso-2022-jp", decode("utf8", $var))
*print $var
--

*I get proper output (no unsightly question-marks):
↓東京メト*日比谷線・北千住行

So, what's the deal? *
Thanks that I have my crystal ball working. I can see clearly that the
forth
character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92)
which is
not present in ISO-2022-JP as defined by RFC 1468 so python converts
it into
question mark as you requested. Meanwhile perl as usual is trying to
guess what
you want and silently converts that character into 'KATAKANA LETTER
ME' (U+30E1)
which is present in ISO-2022-JP.
Why can't python properly encode some of these
characters?
Because "Explicit is better than implicit". Do you care about
roundtripping?
Do you care about width of characters? What about full-width " (U
+FF02)? Python
doesn't know answers to these questions so it doesn't do anything with
your
input. You have to do it yourself. Assuming you don't care about
roundtripping
and width here is an example demonstrating how to deal with narrow
characters:

from unicodedata import normalize
iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in
range(0xFF61,0xFFE0))
print repr(u'\uFF92'.translate(iso2022_squeezing))

It prints u'\u30e1'. Feel free to ask questions if something is not
clear.

Note, this is just an example, I *don't* claim it does what you want
for any character
in FF61-FFDF range. You may want to carefully review the whole unicode
block:
http://www.unicode.org/charts/PDF/UFF00.pdf

-- Leo.

Oct 24 '07 #4
Thanks Leo, and everyone else, these were very helpful replies. The
issue was exactly as Leo described, and I apologize for not being
aware of it, and thus not quite reporting it correctly.

At the moment I don't care about round-tripping between half-width and
full-width kana, rather I need only be able to rely on any particular
kana character be translated correctly to its half-width or full-width
equivalent, and I need the Japanese I send out to be readable.

I appreciate the 'implicit versus explicit' point, and have read about
it in a few different python mailing lists. In this instance it seems
that perl perhaps ought to flash a warning notification regarding what
it is doing, but as this conversion between half-width and full-width
characters is by far the most logical one available, it also seems
reasonable that python might perhaps include such capabilities by
default, just as it currently includes the 'replace' option for
mapping missed characters generically to '?'.

I still haven't worked out the entire mapping routine, but Leo's hint
is probably sufficient to get it working with a bit more effort.

Again, thanks for the help.

-Joe
Thanks that I have my crystal ball working. I can see clearly that the
forth
character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92)
which is
not present in ISO-2022-JP as defined by RFC 1468 so python converts
it into
question mark as you requested. Meanwhile perl as usual is trying to
guess what
you want and silently converts that character into 'KATAKANA LETTER
ME' (U+30E1)
which is present in ISO-2022-JP.
Why can't python properly encode some of these
characters?

Because "Explicit is better than implicit". Do you care about
roundtripping?
Do you care about width of characters? What about full-width " (U
+FF02)? Python
doesn't know answers to these questions so it doesn't do anything with
your
input. You have to do it yourself. Assuming you don't care about
roundtripping
and width here is an example demonstrating how to deal with narrow
characters:

from unicodedata import normalize
iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in
range(0xFF61,0xFFE0))
print repr(u'\uFF92'.translate(iso2022_squeezing))

It prints u'\u30e1'. Feel free to ask questions if something is not
clear.

Note, this is just an example, I *don't* claim it does what you want
for any character
in FF61-FFDF range. You may want to carefully review the whole unicode
block:http://www.unicode.org/charts/PDF/UFF00.pdf

-- Leo.

Oct 24 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: David Thomas | last post by:
Hi there, a while ago, I posted a question regarding reading japanese text from a text file. Well, since I solved the problem, I thought I'd post my solution for the benefit of other people with...
1
by: GPenn | last post by:
SQL 2000, latest SP. We currently have the need to store data from a UTF-8 application in multiple languages in a single database. Our findings thus far support the fact that single-byte and...
1
by: Sriv Chakravarthy | last post by:
I am trying to use xerces-c SAX parser to parse japanese characters. I have a <?xml... utf-8> line in the xml file. When the parser encounters the jap characters it throws a UTFDataFormatException....
2
by: Robert M. Gary | last post by:
I'm on a Solaris 9 Japanese machine w/ an Ultra 5 Sparc CPU. I'm using Xerces 2.6 DOM I've got a document in UTF-8 format.. <?xml version="1.0" encoding="UTF-8"?>...
3
by: Benoit Martin | last post by:
in my windows app, I have some japanese text that I load from a text file and display on a label. No matter what type of encoding I try to use on the text file, the text always comes up as a bunch...
1
by: jim figurski | last post by:
Hi, I have an American computer using windows XP. I recently bought a japanese game to help me learn japanese as I play. I installed the game sucessfully, but the letters are not displayed in...
21
by: Doug Lerner | last post by:
I'm working on a client/server app that seems to work fine in OS Firefox and Windows IE and Firefox. However, in OS X Safari, although the UI/communications themselves work fine, if the...
3
by: paulgor | last post by:
Hi, May be it's a know issue but my search brought nothing... We have static HTML files with Japanese text in UTF-8 encoding - it's on-line Help for our application, so there are no Web...
1
by: PHP Wooer | last post by:
Can anybody there please help me out? I am having a problem with the display of Japanese character in the subject line of the mails sent in Japanese language. This problem is particularly with...
1
by: sandeepindia | last post by:
I m using PHP Version 4.3.2 & Mysql version 3.23.58(No collation etc). My site is in shift JIS encoding. I've both japanese users & russian users. They have entered their member profile in their...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: marcoviolo | last post by:
Dear all, I would like to implement on my worksheet an vlookup dynamic , that consider a change of pivot excel via win32com, from an external excel (without open it) and save the new file into a...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.