Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old August 13th, 2008, 09:25 AM
kettle
Guest
 
Posts: n/a
Default python tr equivalent (non-ascii)

Hi,
I was wondering how I ought to be handling character range
translations in python.

What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:

tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/;

and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:

my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])

then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range, t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)

but it generates an encoding error... and if I encodethe ranges in
utf8 before passing them on I get a length error because maketrans is
counting bytes not characters and utf8 is variable width...
my_trans_string =
my_test_string.translate(string.maketrans(f_range. encode("utf8"),t_range.encode("utf8")))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: maketrans arguments must have same length
  #2  
Old August 13th, 2008, 09:35 AM
kettle
Guest
 
Posts: n/a
Default Re: python tr equivalent (non-ascii)

On Aug 13, 5:18 pm, kettle <Josef.Robert.No...@gmail.comwrote:
Quote:
Hi,
I was wondering how I ought to be handling character range
translations in python.
>
What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:
>
tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/;
>
and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:
>
my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])
>
then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range, t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)
>
but it generates an encoding error... and if I encodethe ranges in
utf8 before passing them on I get a length error because maketrans is
counting bytes not characters and utf8 is variable width...
my_trans_string =
my_test_string.translate(string.maketrans(f_range. encode("utf8"),t_range.encode("utf8")))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: maketrans arguments must have same length
Ok so I guess I was barking up the wrong tree. Searching for python $BA43Q(B
$B!!H>3Q(B quickly brought up a solution:
Quote:
Quote:
Quote:
>>>import unicodedata
>>>my_test_string=u"$B%U%,%[%2(B-%*@A$B#B#C!]!s!v!w#1#2(B3"
>>>print unicodedata.normalize('NFKC', my_test_string.decode("utf8"))
$B%U%,%[%2(B-%*@ABC-%*@123
Quote:
Quote:
Quote:
>>>
still, it would be nice if there was a more general solution, or if
maketrans actually looked at chars instead of bytes methinks.


  #3  
Old August 13th, 2008, 09:35 AM
Fredrik Lundh
Guest
 
Posts: n/a
Default Re: python tr equivalent (non-ascii)

kettle wrote:
Quote:
I was wondering how I ought to be handling character range
translations in python.
>
What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:
>
tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/;
>
and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:
>
my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])
>
then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range, t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)
maketrans only works for byte strings.

as for translate itself, it has different signatures for byte strings
and unicode strings; in the former case, it takes lookup table
represented as a 256-byte string (e.g. created by maketrans), in the
latter case, it takes a dictionary mapping from ordinals to ordinals or
unicode strings.

something like

lut = dict((0xff00 + ch, 0x0020 + ch) for ch in range(0x80))

new_string = old_string.translate(lut)

could work (untested).

</F>

  #4  
Old August 13th, 2008, 11:15 AM
kettle
Guest
 
Posts: n/a
Default Re: python tr equivalent (non-ascii)

On Aug 13, 5:33 pm, Fredrik Lundh <fred...@pythonware.comwrote:
Quote:
kettle wrote:
Quote:
I was wondering how I ought to be handling character range
translations in python.
>
Quote:
What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:
>
Quote:
tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/;
>
Quote:
and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:
>
Quote:
my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])
>
Quote:
then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range, t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)
>
maketrans only works for byte strings.
>
as for translate itself, it has different signatures for byte strings
and unicode strings; in the former case, it takes lookup table
represented as a 256-byte string (e.g. created by maketrans), in the
latter case, it takes a dictionary mapping from ordinals to ordinals or
unicode strings.
>
something like
>
lut = dict((0xff00 + ch, 0x0020 + ch) for ch in range(0x80))
>
new_string = old_string.translate(lut)
>
could work (untested).
>
</F>
excellent. i didnt realize from the docs that i could do that. thanks
 

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles