By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
428,838 Members | 2,222 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 428,838 IT Pros & Developers. It's quick & easy.

Anyway to designating the encoding of the "source" for compile?

P: n/a
Python's InteractiveInterpreter uses the built-in compile function.

According to the ref. manual, it doesn't seem to concern about the
encoding of the source string.

When I hand in an unicode object, it is encoded in utf-8 automatically.
It can be a problem when I'm building an interactive environment using
"compile", with a different encoding from utf-8. IDLE itself has the
same problem. ( '<a string with non-ascii-encoding>' is treated okay
but u'<a string with non-ascii-encoding>' is treated wrong.)

Any suggestions or any plans in future python versions?

Jul 19 '05 #1
Share this Question
Share on Google+
8 Replies


P: n/a
ja***********@hotmail.com wrote:
Python's InteractiveInterpreter uses the built-in compile function.

According to the ref. manual, it doesn't seem to concern about the
encoding of the source string.

When I hand in an unicode object, it is encoded in utf-8 automatically. It can be a problem when I'm building an interactive environment using "compile", with a different encoding from utf-8. IDLE itself has the
same problem. ( '<a string with non-ascii-encoding>' is treated okay
but u'<a string with non-ascii-encoding>' is treated wrong.)

Any suggestions or any plans in future python versions?


I've read a posting from Martin Von Loewis mentioning trying to build
in that feature(optionally marking encoding when calling "compile").
Anyone knows how it is going on?

Jul 19 '05 #2

P: n/a
On 16 May 2005 10:15:22 -0700, ja***********@hotmail.com wrote:
ja***********@hotmail.com wrote:
Python's InteractiveInterpreter uses the built-in compile function.

According to the ref. manual, it doesn't seem to concern about the
encoding of the source string.

When I hand in an unicode object, it is encoded in utf-8

automatically.
It can be a problem when I'm building an interactive environment

using
"compile", with a different encoding from utf-8.
I don't understand this. Suppose your "different encoding" is cp125x
(where x is a digit). Would you not do something like this?

compile_input = user_input.decode('cp125x')
code_object = compile(compile_input, ......

IDLE itself has the
same problem. ( '<a string with non-ascii-encoding>' is treated okay
but u'<a string with non-ascii-encoding>' is treated wrong.)

Any suggestions or any plans in future python versions?


I've read a posting from Martin Von Loewis mentioning trying to build
in that feature(optionally marking encoding when calling "compile").
Anyone knows how it is going on?


Firstly, it would help those who might be trying to help you if you
could post a simple example: input, output, what error message, what
you mean by 'is treated wrong' ... and when it comes to Unicode
objects (indeed any text), show us repr(text) -- "what you see is
often not what others see and often not what you've actually got".

Secondly, are any of the contents of PEP 263 of any use to you?
http://www.python.org/peps/pep-0263.html


Jul 19 '05 #3

P: n/a
John Machin 작성:
On 16 May 2005 10:15:22 -0700, ja***********@hotmail.com wrote:
ja***********@hotmail.com wrote:
Python's InteractiveInterpreter uses the built-in compile function.
According to the ref. manual, it doesn't seem to concern about the
encoding of the source string.

When I hand in an unicode object, it is encoded in utf-8automatically.
It can be a problem when I'm building an interactive environment

using
"compile", with a different encoding from utf-8.
I don't understand this. Suppose your "different encoding" is cp125x
(where x is a digit). Would you not do something like this?

compile_input = user_input.decode('cp125x')
code_object = compile(compile_input, ......

IDLE itself has the
same problem. ( '<a string with non-ascii-encoding>' is treated okay but u'<a string with non-ascii-encoding>' is treated wrong.)

Any suggestions or any plans in future python versions?


I've read a posting from Martin Von Loewis mentioning trying to build
in that feature(optionally marking encoding when calling "compile").
Anyone knows how it is going on?


Firstly, it would help those who might be trying to help you if you
could post a simple example: input, output, what error message, what
you mean by 'is treated wrong' ... and when it comes to Unicode
objects (indeed any text), show us repr(text) -- "what you see is
often not what others see and often not what you've actually got".

Secondly, are any of the contents of PEP 263 of any use to you?
http://www.python.org/peps/pep-0263.html

Okay, I'll use one of the CJK codecs as the example. EUC-KR is the
default encoding.
import sys;sys.getdefaultencoding() 'euc-kr' '한글' '\xc7\xd1\xb1\xdb' u'한글' u'\ud55c\uae00' s=compile("inside=u'한글'",'','single')
exec s
inside #wrong u'\xc7\xd1\xb1\xdb' s=compile(u"inside=u'한글'",'','single')
exec s
inside #correct u'\ud55c\uae00'

So I reckon that the "compile" should get a unicode object. However...

C:\Python24\Lib>python code.py <string>(1)?()

(Pdb) c
Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole) '한글' '\xc7\xd1\xb1\xdb' u'한글' #wrong.. should be u'\ud55c\uae00' instead u'\xc7\xd1\xb1\xdb' import sys;sys.getdefaultencoding() 'euc-kr' ^Z


Am I right that I assume the problem lies in the code.py(and therefore
in codeop.py)? To correct the problem, I seem to parse each string and
change the literal unicode object... Hmm... Sounds a bad approach.

Jul 19 '05 #4

P: n/a

janeaustin...@hotmail.com wrote:
John Machin 작성:
On 16 May 2005 10:15:22 -0700, ja***********@hotmail.com wrote:
ja***********@hotmail.com wrote:
> Python's InteractiveInterpreter uses the built-in compile function.>
> According to the ref. manual, it doesn't seem to concern about the> encoding of the source string.
>
> When I hand in an unicode object, it is encoded in utf-8
automatically.
> It can be a problem when I'm building an interactive environment
using
> "compile", with a different encoding from utf-8.
I don't understand this. Suppose your "different encoding" is cp125x
(where x is a digit). Would you not do something like this?

compile_input = user_input.decode('cp125x')
code_object = compile(compile_input, ......

> IDLE itself has the
> same problem. ( '<a string with non-ascii-encoding>' is treated okay> but u'<a string with non-ascii-encoding>' is treated wrong.)
>
> Any suggestions or any plans in future python versions?

I've read a posting from Martin Von Loewis mentioning trying to buildin that feature(optionally marking encoding when calling "compile").Anyone knows how it is going on?


Firstly, it would help those who might be trying to help you if you
could post a simple example: input, output, what error message, what you mean by 'is treated wrong' ... and when it comes to Unicode
objects (indeed any text), show us repr(text) -- "what you see is
often not what others see and often not what you've actually got".

Secondly, are any of the contents of PEP 263 of any use to you?
http://www.python.org/peps/pep-0263.html

Okay, I'll use one of the CJK codecs as the example. EUC-KR is the
default encoding.
import sys;sys.getdefaultencoding() 'euc-kr' '한글' '\xc7\xd1\xb1\xdb' u'한글' u'\ud55c\uae00' s=compile("inside=u'한글'",'','single')
exec s
inside #wrong u'\xc7\xd1\xb1\xdb' s=compile(u"inside=u'한글'",'','single')
exec s
inside #correct u'\ud55c\uae00'

So I reckon that the "compile" should get a unicode object.

However...
C:\Python24\Lib>python code.py
<string>(1)?() (Pdb) c
Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)]

on win32
Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole) '한글' '\xc7\xd1\xb1\xdb' u'한글' #wrong.. should be u'\ud55c\uae00' instead u'\xc7\xd1\xb1\xdb' import sys;sys.getdefaultencoding() 'euc-kr' ^Z
Am I right that I assume the problem lies in the code.py(and therefore in codeop.py)? To correct the problem, I seem to parse each string and change the literal unicode object... Hmm... Sounds a bad approach.


Oh, I forgot one more thing.

C:\Python24\Lib>python
Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
s=compile(u"'한글'",'','single')
exec s #wrong. the result is encoded in utf-8 instead of euc-kr '\xed\x95\x9c\xea\xb8\x80' s=compile(u"u'한글'",'','single')
exec s #correct u'\ud55c\uae00'


Jul 19 '05 #5

P: n/a
On 16 May 2005 16:44:30 -0700, ja***********@hotmail.com wrote:

janeaustin...@hotmail.com wrote:
John Machin ??:
> On 16 May 2005 10:15:22 -0700, ja***********@hotmail.com wrote:
>
> >ja***********@hotmail.com wrote:
> >> Python's InteractiveInterpreter uses the built-in compile

function.
> >>
> >> According to the ref. manual, it doesn't seem to concern aboutthe > >> encoding of the source string.
> >>
> >> When I hand in an unicode object, it is encoded in utf-8
> >automatically.
> >> It can be a problem when I'm building an interactive environment
> >using
> >> "compile", with a different encoding from utf-8.
>
==== This is *EXACTLY* what your problem is ==== > I don't understand this. Suppose your "different encoding" iscp125x > (where x is a digit). Would you not do something like this?
>
> compile_input = user_input.decode('cp125x')
> code_object = compile(compile_input, ...... =================================================


==== It would have helped had you followed this ========== > and when it comes to Unicode
> objects (indeed any text), show us repr(text) -- "what you see is
> often not what others see and often not what you've actually got". ================================================== =========
Okay, I'll use one of the CJK codecs as the example. EUC-KR is the
default encoding.
>>> import sys;sys.getdefaultencoding()

'euc-kr'
>>> '??'
# There's a very strong assumption that the above was originally
encoded in euc-kr but by the time I copied the 2 chars out of my
browser it was definitely Unicode. See what I mean about using repr()?
'\xc7\xd1\xb1\xdb'
>>> u'??'

u'\ud55c\uae00'
>>> s=compile("inside=u'??'",'','single')
>>> exec s
>>> inside #wrong


[big snip]

Like I said, *ALL* you have to do (like in any other Unicode-aware
app) is decode your user input into Unicode (you *don't* need to parse
bits and pieces of it) and feed it in ... like this:
user_input_kr = "inside=u'\xc7\xd1\xb1\xdb'"
user_input_uc = user_input_kr.decode('euc-kr')
user_input_uc u"inside=u'\ud55c\uae00'" s = compile(user_input_uc, '', 'single')
exec s
inside u'\ud55c\uae00' # right


HTH,
John

Jul 19 '05 #6

P: n/a
John Machin wrote:
On 16 May 2005 16:44:30 -0700, ja***********@hotmail.com wrote: [snip]

Like I said, *ALL* you have to do (like in any other Unicode-aware
app) is decode your user input into Unicode (you *don't* need to parse bits and pieces of it) and feed it in ... like this:
user_input_kr = "inside=u'\xc7\xd1\xb1\xdb'"
user_input_uc = user_input_kr.decode('euc-kr')
user_input_uc u"inside=u'\ud55c\uae00'" s = compile(user_input_uc, '', 'single')
exec s
inside u'\ud55c\uae00' # right


HTH,
John


Thank you but there is still a problem.

|>>> s='euckr="\xc7\xd1";uni=u"\xc7\xd1"'
|>>> su=s.decode('euc-kr')
|>>> su
|u'euckr="\ud55c";uni=u"\ud55c"'
|>>> c=compile(su,'','single')
|>>> exec c
|>>> euckr,uni
|('\xed\x95\x9c', u'\ud55c')
|>>>

As you see the single's result is turned into UTF-8 encoding.

Jul 19 '05 #7

P: n/a
janeaustin...@hotmail.com wrote:
Okay, I'll use one of the CJK codecs as the example. EUC-KR is the
default encoding.
>> import sys;sys.getdefaultencoding() 'euc-kr'
>> '한글'

'\xc7\xd1\xb1\xdb'


That is the problem. Non-ascii characters in byte strings are
deprecated. Here is what I get when I run a deprecated hello
world program in Russian:
------- hello.py ---------
print "Здравствуй, мир!"
--------------------------
C:\py>c:\Python24\python.exe hello.py
sys:1: DeprecationWarning: Non-ASCII character '\xc7' in file
text.py on line 1, but no encoding declared; see
http://www.python.org/peps/pep-0263.html for details
╟фЁртёЄтєщ, ьшЁ!
--------------------------
Oops, not only there is a warning, but it doesn't even work
on Windows in Russian locale. To correct the program I need
to switch to unicode strings:
------- hello.py ---------
# -*- coding: windows-1251 -*-
print u"Здравствуй, мир!"
--------------------------
C:\py>c:\Python24\python.exe hello.py
Здравствуй, мир!
--------------------------

Since non-ascii characters are deprecated in byte strings,
any non-ascii encoding for sys.getdefaultencoding() is
deprecated as well. Don't set it to 'euc-kr'.

Any suggestions or any plans in future python versions?


In python 3.0 byte strings will be gone. So you won't be
able to put non-ascii characters into them.
Serge.

Jul 19 '05 #8

P: n/a
janeaustin...@hotmail.com wrote:
Thank you but there is still a problem.

|>>> s='euckr="\xc7\xd1";uni=u"\xc7\xd1"'
|>>> su=s.decode('euc-kr')
|>>> su
|u'euckr="\ud55c";uni=u"\ud55c"'
su[7] is a non-ascii character inside the byte string euckr
|>>> c=compile(su,'','single')
|>>> exec c
|>>> euckr,uni
|('\xed\x95\x9c', u'\ud55c')
|>>>

As you see the single's result is turned into UTF-8 encoding.


See my previous message. Non-ascii characters in byte strings
are deprecated.

Serge.

Jul 19 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.