
August 13th, 2008, 09:25 AM
| | | python tr equivalent (non-ascii)
Hi,
I was wondering how I ought to be handling character range
translations in python.
What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:
tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/;
and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:
my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])
then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range, t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)
but it generates an encoding error... and if I encodethe ranges in
utf8 before passing them on I get a length error because maketrans is
counting bytes not characters and utf8 is variable width...
my_trans_string =
my_test_string.translate(string.maketrans(f_range. encode("utf8"),t_range.encode("utf8")))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: maketrans arguments must have same length | 
August 13th, 2008, 09:35 AM
| | | Re: python tr equivalent (non-ascii)
On Aug 13, 5:18 pm, kettle <Josef.Robert.No...@gmail.comwrote: Quote:
Hi,
I was wondering how I ought to be handling character range
translations in python.
>
What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:
>
tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/;
>
and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:
>
my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])
>
then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range, t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)
>
but it generates an encoding error... and if I encodethe ranges in
utf8 before passing them on I get a length error because maketrans is
counting bytes not characters and utf8 is variable width...
my_trans_string =
my_test_string.translate(string.maketrans(f_range. encode("utf8"),t_range.encode("utf8")))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: maketrans arguments must have same length
| Ok so I guess I was barking up the wrong tree. Searching for python $BA43Q(B
$B!!H>3Q(B quickly brought up a solution: Quote: Quote: Quote:
>>>import unicodedata
>>>my_test_string=u"$B%U%,%[%2(B-%*@A$B#B#C!]!s!v!w#1#2(B3"
>>>print unicodedata.normalize('NFKC', my_test_string.decode("utf8"))
| | | $B%U%,%[%2(B-%*@ABC-%*@123 still, it would be nice if there was a more general solution, or if
maketrans actually looked at chars instead of bytes methinks. | 
August 13th, 2008, 09:35 AM
| | | Re: python tr equivalent (non-ascii)
kettle wrote: Quote:
I was wondering how I ought to be handling character range
translations in python.
>
What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:
>
tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/;
>
and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:
>
my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])
>
then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range, t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)
| maketrans only works for byte strings.
as for translate itself, it has different signatures for byte strings
and unicode strings; in the former case, it takes lookup table
represented as a 256-byte string (e.g. created by maketrans), in the
latter case, it takes a dictionary mapping from ordinals to ordinals or
unicode strings.
something like
lut = dict((0xff00 + ch, 0x0020 + ch) for ch in range(0x80))
new_string = old_string.translate(lut)
could work (untested).
</F> | 
August 13th, 2008, 11:15 AM
| | | Re: python tr equivalent (non-ascii)
On Aug 13, 5:33 pm, Fredrik Lundh <fred...@pythonware.comwrote: Quote:
kettle wrote: Quote:
I was wondering how I ought to be handling character range
translations in python.
| > Quote:
What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:
| > Quote: |
tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/;
| > Quote:
and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:
| > Quote:
my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])
| > Quote:
then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range, t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)
| >
maketrans only works for byte strings.
>
as for translate itself, it has different signatures for byte strings
and unicode strings; in the former case, it takes lookup table
represented as a 256-byte string (e.g. created by maketrans), in the
latter case, it takes a dictionary mapping from ordinals to ordinals or
unicode strings.
>
something like
>
lut = dict((0xff00 + ch, 0x0020 + ch) for ch in range(0x80))
>
new_string = old_string.translate(lut)
>
could work (untested).
>
</F>
| excellent. i didnt realize from the docs that i could do that. thanks |
Posting Rules
| You may not post new threads You may not post replies You may not post attachments You may not edit your posts HTML code is Off | | | | | | What is Bytes?
We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights.
Get the best answers to your questions from over network members.
|