By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,645 Members | 1,065 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,645 IT Pros & Developers. It's quick & easy.

Non-unicode strings & Python.

P: n/a
All:

Question

Python is currently Unicode Compliant.

What happens when strings are read in from text files that were
created using GB 2312-1980, or KPS 9566-2003, or other, equally
obscure code ranges?

The idea is to read text in the file format, and replace it with the
appropriate Unicode character,then write it out as a new text file.
[Trivial to program, but incredibly time consuming to actually code]

xan

jonathon
--
Goto http://graphology.meetup.com for information about International
Graphology Meetup Day
Jul 18 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Jonathon Blake wrote:
What happens when strings are read in from text files that were
created using GB 2312-1980, or KPS 9566-2003, or other, equally
obscure code ranges?
Python has two kinds of strings: byte strings, and Unicode strings.
If you read data from a file, you get byte strings - i.e. a sequence
of bytes representing literally the encoded contents of the file.
If you want Unicode strings, you need to use codecs.open.
The idea is to read text in the file format, and replace it with the
appropriate Unicode character,then write it out as a new text file.
[Trivial to program, but incredibly time consuming to actually code]


Not at all:

data = codecs.open(filename, "r", encoding="gb2312")
codecs.open(newfile, "w", encoding="utf-8").write(data)

assuming that by "appropriate Unicode character" you actually mean
"I want to write the file encoded as UTF-8".

Regards,
Martin
Jul 18 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.