Connecting Tech Pros Worldwide Forums | Help | Site Map

Non-unicode strings & Python.

Jonathon Blake
Guest
 
Posts: n/a
#1: Jul 18 '05
All:

Question

Python is currently Unicode Compliant.

What happens when strings are read in from text files that were
created using GB 2312-1980, or KPS 9566-2003, or other, equally
obscure code ranges?

The idea is to read text in the file format, and replace it with the
appropriate Unicode character,then write it out as a new text file.
[Trivial to program, but incredibly time consuming to actually code]

xan

jonathon
--
Goto http://graphology.meetup.com for information about International
Graphology Meetup Day

Martin v. Löwis
Guest
 
Posts: n/a
#2: Jul 18 '05

re: Non-unicode strings & Python.


Jonathon Blake wrote:[color=blue]
> What happens when strings are read in from text files that were
> created using GB 2312-1980, or KPS 9566-2003, or other, equally
> obscure code ranges?[/color]

Python has two kinds of strings: byte strings, and Unicode strings.
If you read data from a file, you get byte strings - i.e. a sequence
of bytes representing literally the encoded contents of the file.
If you want Unicode strings, you need to use codecs.open.
[color=blue]
> The idea is to read text in the file format, and replace it with the
> appropriate Unicode character,then write it out as a new text file.
> [Trivial to program, but incredibly time consuming to actually code][/color]

Not at all:

data = codecs.open(filename, "r", encoding="gb2312")
codecs.open(newfile, "w", encoding="utf-8").write(data)

assuming that by "appropriate Unicode character" you actually mean
"I want to write the file encoded as UTF-8".

Regards,
Martin
Closed Thread