473,400 Members | 2,145 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,400 software developers and data experts.

treating str as unicode in legacy code?

Ben
I'm left with some legacy code using plain old str, and I need to make
sure it works with unicode input/output. I have a simple plan to do
this:

- Run the code with "python -U" so all the string literals become
unicode litrals.
- Add this statement

str = unicode

to all .py files so the type comparison (e.g., type('123') == str)
would work.
Did I miss anything? Does this sound like a workable plan?

Thanks!

Apr 12 '07 #1
2 1573
Ben wrote:
I'm left with some legacy code using plain old str, and I need to make
sure it works with unicode input/output. I have a simple plan to do
this:

- Run the code with "python -U" so all the string literals become
unicode litrals.
- Add this statement

str = unicode

to all .py files so the type comparison (e.g., type('123') == str)
would work.
Did I miss anything? Does this sound like a workable plan?

Thanks!
Well, don't forget that the assignment to str *shadows* the built-in
rather than replacing it, so there may be places (imported modules being
the example that most readily springs to mind) where that replacement
won't be effective.

Plus which in CPython the C parts of the code may well be creating and
expecting objects of type str but they won't use the Python naming
mechanism at all, so you will have no way to effect changes in those
behaviors.

This will probably account for about 95% of any strangeness you see, but
it's probably a good first step in the conversion process.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Apr 13 '07 #2
On Apr 13, 5:57 am, "Ben" <benjamin....@gmail.comwrote:
I'm left with some legacy code using plain oldstr, and I need to make
sure it works withunicodeinput/output. I have a simple plan to do
this:

- Run the code with "python -U" so all the string literals becomeunicodelitrals.
Requiring that the code is always run with a non-default argument
doesn't seem very robust/portable to me.
- Add this statement

str=unicode

to all .py files so the type comparison (e.g., type('123') ==str)
would work.
IMVHO (1) doing that merely changes "legacy code" to "kludged legacy
code" (2) there is no substitute for reading the code and trying to
nut out what it is doing.

Do you mean that those two things are the ONLY changes you plan to
make?
Did I miss anything? Does this sound like a workable plan?
Do you need to make sure it still works with ASCII input? With input
in some other encoding e.g. cp1252?

What do you mean by "unicode input"? Bear in mind that if you want to
work with Python unicode objects internally, input from a file /
socket / whatever will need to be decoded i.e. you will have to read
the code and make appropriate changes. Data stored in (say) utf_16_le
encoding is not "unicode" in the sense that you need; it still has to
be decoded.

What do you mean by "unicode output"? You are going to need to encode
your output.

This doesn't work; the output is not "unicode" in any meaningful
sense:
>>f = open(u'uout', u'w')
### Warning: you need to hope that all builtins etc that you are
calling cope with unicode arguments as well as the above one does.
>>f.write(u'abcde\n')
f.close()
open(u'uout', u'rb').read()
'abcde\r\n'

This doesn't work; it crashes.
>>f = open('uout2', u'w')
f.write(u'abcde\xff\n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in
position 5:
ordinal not in range(128)
>>>
Some object methods work differently with unicode; e.g. (1)
str.translate and unicode.translate.

(2)
>>'abc\xA0def'.split()
['abc\xa0def']
>>u'abc\xA0def'.split()
[u'abc', u'def']
NameError: name 'isspace' is not defined
>>'\xA0'.isspace()
False
>>u'\xA0'.isspace()
True
>>>
HTH,
John

Apr 14 '07 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

23
by: Hallvard B Furuseth | last post by:
Has someone got a Python routine or module which converts Unicode strings to lowercase (or uppercase)? What I actually need to do is to compare a number of strings in a case-insensitive manner,...
19
by: Svennglenn | last post by:
I'm working on a program that is supposed to save different information to text files. Because the program is in swedish i have to use unicode text for ÅÄÖ letters. When I run the following...
5
by: srikant | last post by:
I am writing a client in C# that needs to communicate over the network to a legacy C++ application that uses Unicode strings. I realize that C# strings are already in Unicode, however, how do I...
8
by: Preben Randhol | last post by:
Hi If I use len() on a string containing unicode letters I get the number of bytes the string uses. This means that len() can report size 6 when the unicode string only contains 3 characters...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.