473,411 Members | 2,085 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,411 software developers and data experts.

Question concerning Unicode and or Shift-JIS

Ok, so Im a newb python programmer and I'm trying to create a simple python web-application. The program is simply going to read in pairs of words, parse them into a dictionary file, then randomly display the key and prompt the user for the correct answer. Basically, its a digital flash card system with a modular "dictionary" file.

The problem is this: I'm trying to create this program to help me study foregin languages (specifically Japanese at the moment) and when I save the txt file which houses the word pairs, it is automatically encoded into UTF. However, when getting user input, the input it natively sent to the program in Shift-JIS encoding. I downloaded CJKcodecs for python to encode a string into any number of Japanese codings, however the problem is I don't know how to "decode" the UTF and then recode it into Shift-JIS so that I can compare the dictionary values with the input values. OR, I could convert the input from Shift-JIS to UTF, but either way I don't know how to decode any of the codecs. I'm sure theres just some simple function call, but I have been unable to find it.

Anyhelp would be appreciated!
Thanks =)
Jul 18 '05 #1
3 6994
Antioch:
however the problem is I don't know how to "decode" the UTF and then
recode it into Shift-JIS so that I can compare the dictionary values with
the input values.


I don't have a Shift-JIS codec installed so this breaks but should work
if you have the codec installed:

y = '\xe3\x81\x8b\xe3\x82\x8f\xe3\x81\x95\xe3\x81\x8d'
print y
print repr(y)
u = unicode(y, "utf-8")
print repr(u)
s = u.encode("shift-jis")
print s

Neil
Jul 18 '05 #2
Antioch wrote:
Ok, so Im a newb python programmer and I'm trying to create a simple python
web-application. The program is simply going to read in pairs of words, parse
them into a dictionary file, then randomly display the key and prompt the
user for the correct answer. Basically, its a digital flash card system with
a modular "dictionary" file.

The problem is this: I'm trying to create this program to help me study
foregin languages (specifically Japanese at the moment) and when I save the
txt file which houses the word pairs, it is automatically encoded into UTF.
However, when getting user input, the input it natively sent to the program
in Shift-JIS encoding. I downloaded CJKcodecs for python to encode a string
into any number of Japanese codings, however the problem is I don't know how
to "decode" the UTF and then recode it into Shift-JIS so that I can compare
the dictionary values with the input values. OR, I could convert the input
from Shift-JIS to UTF, but either way I don't know how to decode any of the
codecs. I'm sure theres just some simple function call, but I have been
unable to find it.

Anyhelp would be appreciated! Thanks =)


If s is a string encoded in UTF-8, converting it in Shift-JIS will be something
like:

s2 = unicode(s, 'utf-8').encode('shift-jis')

For the reverse:

s = unicode(s2, 'shift-jis').encode('utf-8')

You have to make sure s contains only valid japanese characters or the encoding
/ decoding to / from Shift-JIS will fail and you'll get a ValueError exception.

For further details, see the unicode function @
http://www.python.org/doc/current/li...cs.html#l2h-71 , the decode
and encode methods on strings @
http://www.python.org/doc/current/li...g-methods.html and the codecs module
@ http://www.python.org/doc/current/li...le-codecs.html

HTH
--
- Eric Brunel <eric (underscore) brunel (at) despammed (dot) com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com

Jul 18 '05 #3
Antioch <se************@SPAMTRAP.hotmail.com> wrote:
However, when getting user input, the input it natively sent to the
program in Shift-JIS encoding.
You can change this by setting the encoding of the web page containing
the <form>, either by having the server send an HTTP response header
'Content-Type: text/html;charset=utf-8', or by including a
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
element inside the <head>.

If you don't set the charset like this, the browser will have to guess
what encoding to use, which in the absence of any properly-encoded
Japanese text on the form page could be anything, and may default to
different encodings dependent on the user's locale.

[Detour:]

The browser will send the form submission in the specified encoding,
*unless* the user deliberately goes to the browser's encodings menu
and selects a different one. Unlikely, but possible. The 'proper' way
around this is to write <form accept-charset="utf-8">, which should mean
that the browser should send the submission as UTF-8 regardless of the
encoding of the page containing the form. Unfortunately Internet
Explorer on Windows is broken and stupid, and prefers to use this as
a 'backup' encoding: it will use the current page's encoding for fields
which can be encoded in that, and the accept-charset encoding on
fields that contain characters that can't be encoded in the current
page's charset. Thus you can get a mixture of encodings with absolutely
no way to determine which is which.

The IE-compatible but utterly hideous workaround is to avoid
accept-charset and include a hidden field with name '_charset_' in the
form. IE will fill it in with the currently selected encoding when the
form is submitted. Whether it is worth doing this is debateable.

[end detour.]
the problem is I don't know how to "decode" the UTF and then recode
it into Shift-JIS so that I can compare the dictionary values
I'd definitely recommend storing the dictionary values as Unicode strings
rather than trying to compare encoded versions.
I don't know how to decode any of the codecs. I'm sure theres just
some simple function call


Yep:

characterString= unicode(jisString, 'shift_jis')
utfString= characterString.encode('utf-8')

--
Andrew Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/
Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Basil | last post by:
Hello. I have compiler BC Builder 6.0. I have an example: #include <strstrea.h> int main () { wchar_t ff = {' s','d ', 'f', 'g', 't'};
30
by: Richard | last post by:
Level: Java newbie, C experienced Platform: Linux and Win32, Intel Another programmer and I are working on a small project together. He's writing a server process in Java that accepts input...
3
by: Christopher Ireland | last post by:
Hi - It's funny ... this works fine: private void button2_Click(object sender, System.EventArgs e) { string s = "\u0075"; char c = Convert.ToChar(s); label1.Text = c.ToString(); }
0
by: kurotsuke | last post by:
I need to convert a sequence of keys presses on the keyboard into the corresponding character code (UNICODE). I'm intercepting the KeyUp event (using an external hooking library) and need to get...
8
by: John Salerno | last post by:
Ok, for those who have gotten as far as level 2 (don't laugh!), I have a question. I did the translation as such: import string alphabet = string.lowercase code = string.lowercase + 'ab'...
9
by: Thomas Ploch | last post by:
Hello fellow pythonists, I have a question concerning posting code on this list. I want to post source code of a module, which is a homework for university (yes yes, I know, please read...
15
by: Christopher Layne | last post by:
So I recently ran into a situation where I invoked UB without specifically knowing I did it. Yes, human, I know. What exactly is/was the rationale for not allowing shifts to be the same width of...
1
by: Patrick | last post by:
Hi I have a basic question concerning rotations and bitmasking. Assume the following code fragement. uint32 p_lo = { 0x00, 0x00}; for (j = 0; j < 64; j++ )
1
by: pitjpz | last post by:
We have moved our Database to another server. The server it was on used SQL 4 and the new one its on now uses SQL5 the only problem we can find is that when you attempt to delete a record from...
1
by: matthewroth | last post by:
I have searched high and low and am stumped on this. The below code is a checklist form for work. there are 2 shifts 7 days a week and it displays the checklist items for each shift. something i did...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.