A Unicode problem -HELP

manstey

I am writing a program to translate a list of ascii letters into a
different language that requires unicode encoding. This is what I have
done so far:

1. I have ï»¿# -*- coding: UTF-8 -*- as my first line.
2. In Wing IDE I have set Default Encoding to UTF-8
3. I have imported codecs and opened and written my file, which doesn't
have a BOM, as encoding=UTF-8
4. I have written a dictionary for translation, with entries such as
{'F':u'\u0254'} and a function to do the translation

Everything works fine, except that my output file, when loaded in
unicode aware emeditor has
(u'F', u'\u0254')

But I want to display it as:
('F', 'É”') # where the É” is a back-to-front 'c'

So my questions are:
1. How do I do this?
2. Do I need to change any of my steps above?

May 12 '06 #1

Subscribe Reply

1867

Martin v. LÃ¶wis

manstey wrote:

1. I have # -*- coding: UTF-8 -*- as my first line.
2. In Wing IDE I have set Default Encoding to UTF-8
3. I have imported codecs and opened and written my file, which doesn't
have a BOM, as encoding=UTF-8
4. I have written a dictionary for translation, with entries such as
{'F':u'\u0254'} and a function to do the translation

Everything works fine, except that my output file, when loaded in
unicode aware emeditor has
(u'F', u'\u0254')
I couldn't quite follow this description: what is "your output file"
(in what step is it created?), and how does

(u'F', u'\u0254')

get into this file? What is the precise Python statement that
produces that line of output?
So my questions are:
1. How do I do this?

Most likely, you use (directly or indirectly) the repr() function
to convert a tuple into that string. You shouldn't do that;
instead, you should format the elements of the tuple yourself, e.g.
through

print >>f, u"('%s', '%s')" % value

Regards,
Martin

May 12 '06 #2

manstey

Hi Martin,

HEre is how I write:

input_file = open(input_file_loc, 'r')
output_file = open(output_file_loc, 'w')
for line in input_file:
output_file.write(str(word_info + parse + gloss)) # = three
functions that return tuples

(u'F', u'\u0254') are two of the many unicode tuple elements returned
by the three functions.

What am I doing wrong?

May 17 '06 #3

Ben Finney

"manstey" <ma*****@csu.edu.au> writes:

input_file = open(input_file_loc, 'r')
output_file = open(output_file_loc, 'w')
for line in input_file:
output_file.write(str(word_info + parse + gloss)) # = three functions that return tuples

If you mean that 'word_info', 'parse' and 'gloss' are three functions
that return tuples, then you get that return value by calling them.

def foo(): ... return "foo's return value"
... def bar(baz): ... return "bar's return value (including '%s')" % baz
... print foo() foo's return value print bar <function bar at 0x401fe80c> print bar("orange")

bar's return value (including 'orange')

--
\ "A man must consider what a rich realm he abdicates when he |
`\ becomes a conformist." -- Ralph Waldo Emerson |
_o__) |
Ben Finney

May 17 '06 #4

manstey

I'm a newbie at python, so I don't really understand how your answer
solves my unicode problem.

I have done more reading on unicode and then tried my code in IDLE
rather than WING IDE, and discovered that it works fine in IDLE, so I
think WING has a problem with unicode. For example, in WING this code
returns an error:

a={'a':u'\u0254'}
print a['a']
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0254' in
position 0: ordinal not in range(128)

but in IDLE it correctly prints open o

So, assuming I now work in IDLE, all I want help with is how to read in
an ascii string and convert its letters to various unicode values and
save the resulting 'string' to a utf-8 text file. Is this clear?

so in pseudo code
1. F is converted to \u0254, $ is converted to \u0283, C is converted
to \u02A6\02C1, etc.
(i want to do this using a dictionary TRANSLATE={'F':u'\u0254', etc)
2. I read in a file with lines like:
F$
FCF$
$$C$ etc
3. I convert this to
\u0254\u0283
\u0254\u02A6\02C1\u0254 etc
4. i save the results in a new file

when i read the new file in a unicode editor (EmEditor), i don't see
\u0254\u02A6\02C1\u0254, but I see the actual characters (open o, esh,
ts digraph, modified letter reversed glottal stop, etc.

I'm sure this is straightforward but I can't get it to work.

All help appreciated!

May 17 '06 #5

Ben Finney

"manstey" <ma*****@csu.edu.au> writes:

I'm a newbie at python, so I don't really understand how your answer
solves my unicode problem.

Since your replies fail to give any context of the existing
discussion, I could only go by the content of what you'd written in
that message. I didn't see a problem with anything Unicode -- I saw
three objects being added together, which you told us were function
objects. That's the problem I pointed out.

--
\ "When a well-packaged web of lies has been sold to the masses |
`\ over generations, the truth will seem utterly preposterous and |
_o__) its speaker a raving lunatic." -- Dresden James |
Ben Finney

May 17 '06 #6

Martin v. Löwis

manstey wrote:

input_file = open(input_file_loc, 'r')
output_file = open(output_file_loc, 'w')
for line in input_file:
output_file.write(str(word_info + parse + gloss)) # = three
functions that return tuples

(u'F', u'\u0254') are two of the many unicode tuple elements returned
by the three functions.

What am I doing wrong?

Well, the primary problem is that you don't tell us what you are really
doing. For example, it is very hard to believe that this is the actual
code that you are running:

If word_info, parse, and gloss are functions, the code should read

input_file = open(input_file_loc, 'r')
output_file = open(output_file_loc, 'w')
for line in input_file:
output_file.write(str(word_info() + parse() + gloss()))

I.e. you need to call the functions for this code to make any sense.
You have probably chosen to edit the code in order to not show us
your real code. Unfortunately, since you are a newbie in Python,
you make errors in doing so, and omit important details. That makes
it very difficult to help you.

Regards,
Martin

May 17 '06 #7

manstey

OK, I apologise for not being clearer.

1. Here is my input data file, line 2:
gn1:1,1.2 R")$I73YT R")$IYT@ncfsa

2. Here is my output data file, line 2:
u'gn', u'1', u'1', u'1', u'2', u'-', u'R")$I73YT', u'R")$IYT',
u'R")$IYT', u'@', u'ncfsa', u'nc', '', '', '', u'f', u's', u'a', '',
'', '', '', '', '', '', '', u'B.:R")$I^YT', u'b.:cv)cv^yc', '\xc9\x94'

3. Here is my main program:
# -*- coding: UTF-8 -*-
import codecs

import splitFunctions
import surfaceIPA

# Constants for file location

# Working directory constants
dir_root = 'E:\\'
dir_relative = '2 Core\\2b Data\\Data Working\\'

# Input file constants
input_file_name = 'in.grab.txt'
input_file_loc = dir_root + dir_relative + input_file_name
# Initialise input file
input_file = codecs.open(input_file_loc, 'r', 'utf-8')

# Output file constants
output_file_name = 'out.grab.txt'
output_file_loc = dir_root + dir_relative + output_file_name
# Initialise output file
output_file = codecs.open(output_file_loc, 'w', 'utf-8') # unicode

i = 0
for line in input_file:
if line[0] != '>': # Ignore headers
i += 1
if i != 1:
word_info = splitFunctions.splitGrab(line, i)
parse=splitFunctions.splitParse(word_info[10])
gloss=surfaceIPA.surfaceIPA(word_info[6],word_info[8],word_info[9],parse)
a=str(word_info + parse + gloss).encode('utf-8')
a=a[1:len(a)-1]
output_file.write(a)
output_file.write('\n')

input_file.close()
output_file.close()

print 'done'
4. Here is my problem:
At the end of my output file, where my unicode character \u0254 (OPEN
O) appears, the file has '\xc9\x94'

What I want is an output file like:

'gn', '1', '1', '1', '2', '-', ..... 'É”'

where É” is an open O, and would display correctly in the appropriate
font.

Once I can get it to display properly, I will rewrite gloss so that it
returns a proper translation of 'R")$I73YT', which will be a string of
unicode characters.

Is this clearer? The other two functions are basic. splitGrab turns
'gn1:1,1.2 R")$I73YT R")$IYT@ncfsa' into 'gn 1 1 1 2 R")$I73YT R")$IYT
@ ncfsa' and splitParse turns the final piece of this 'ncfsa' into 'n c
f s a'. They have to be done separately as splitParse involves some
translation and program logic. SurfaceIPA reads in 'R")$I73YT' and
other data to produce the unicode string. At the moment it just returns
two dummy strings and u'\u0254'.encode('utf-8').

All help is appreciated!

Thanks

May 17 '06 #8

Martin v. LÃ¶wis

manstey wrote:

a=str(word_info + parse + gloss).encode('utf-8')
a=a[1:len(a)-1]

Is this clearer?

Indeed. The problem is your usage of str() to "render" the output.
As word_info+parse+gloss is a list (or is it a tuple?), str() will
already produce "Python source code", i.e. an ASCII byte string
that can be read back into the interpreter; all Unicode is gone
from that string. If you want comma-separated output, you should
do this:

def comma_separated_utf8(items):
result = []
for item in items:
result.append(item.encode('utf-8'))
return ", ".join(result)

and then
a = comma_separated_utf8(word_info + parse + gloss)

Then you don't have to drop the parentheses from a anymore, as
it won't have parentheses in the first place.

As the encoding will be done already in the output file,
the following should also work:

a = u", ".join(word_info + parse + gloss)

This would make "a" a comma-separated unicode string, so that
the subsequent output_file.write(a) encodes it as UTF-8.

If that doesn't work, I would like to know what the exact
value of gloss is, do

print "GLOSS IS", repr(gloss)

to print it out.

Regards,
Martin

May 17 '06 #9

Tim Roberts

"manstey" <ma*****@csu.edu.au> wrote:

I have done more reading on unicode and then tried my code in IDLE
rather than WING IDE, and discovered that it works fine in IDLE, so I
think WING has a problem with unicode.
Rather, its output defaults to ASCII.
So, assuming I now work in IDLE, all I want help with is how to read in
an ascii string and convert its letters to various unicode values and
save the resulting 'string' to a utf-8 text file. Is this clear?

so in pseudo code
1. F is converted to \u0254, $ is converted to \u0283, C is converted
to \u02A6\02C1, etc.
(i want to do this using a dictionary TRANSLATE={'F':u'\u0254', etc)
2. I read in a file with lines like:
F$
FCF$
$$C$ etc
3. I convert this to
\u0254\u0283
\u0254\u02A6\02C1\u0254 etc
4. i save the results in a new file

when i read the new file in a unicode editor (EmEditor), i don't see
\u0254\u02A6\02C1\u0254, but I see the actual characters (open o, esh,
ts digraph, modified letter reversed glottal stop, etc.
Of course. Isn't that exactly what you wanted? The Python string
u"\u0254" contains one character (Latin small open o). It does NOT contain
6 characters. If you write that to a file, that file will contain 1
character -- 2 bytes.

If you actually want the 6-character string \u0254 written to a file, then
you need to escape the \u special code: "\\u0254". However, I don't see
what good that would do you. The \u escape is a Python source code thing.
I'm sure this is straightforward but I can't get it to work.

I think it is working exactly as you want.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.

May 17 '06 #10

Ben Finney

"manstey" <ma*****@csu.edu.au> writes:

1. Here is my input data file, line 2:
gn1:1,1.2 R")$I73YT R")$IYT@ncfsa
Your program is reading this using the 'utf-8' encoding. When it does
so, all the characters you show above will be read in happily as you
see them (so long as you view them with the 'utf-8' encoding), and
converted to Unicode characters representing the same thing.

Do you have any other information that might indicate this is *not*
utf-8 encoded data?
2. Here is my output data file, line 2:
u'gn', u'1', u'1', u'1', u'2', u'-', u'R")$I73YT', u'R")$IYT',
u'R")$IYT', u'@', u'ncfsa', u'nc', '', '', '', u'f', u's', u'a', '',
'', '', '', '', '', '', '', u'B.:R")$I^YT', u'b.:cv)cv^yc', '\xc9\x94'

As you can see, reading the file with 'utf-8' encoding and writing it
out again as 'utf-8' encoding, the characters (as you posted them in
the message) have been faithfully preserved by Unicode processing and
encoding.
Bear in mind that when you present the "input data file, line 2" to
us, your message is itself encoded using a particular character
encoding. (In the case of the message where you wrote the above, it's
'utf-8'.) This means we may or may not be seeing the exact same bytes
you see in the input file; we're seeing characters in the encoding you
used to post the message.

You need to know what encoding was used when the data in that file was
written. You can then read the file using that encoding, and convert
the characters to unicode for processing inside your program. When you
write them out again, you can choose the 'utf-8' encoding as you have
done.

Have you read this excellent article on understanding the programming
implications of character sets and Unicode?

"The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets (No
Excuses!)"
<URL:http://www.joelonsoftware.com/articles/Unicode.html>

--
\ "I'd like to see a nude opera, because when they hit those high |
`\ notes, I bet you can really see it in those genitals." -- Jack |
_o__) Handey |
Ben Finney

May 17 '06 #11

manstey

Hi Martin,

Thanks very much. Your def comma_separated_utf8(items): approach raises
an exception in codecs.py, so I tried = u", ".join(word_info + parse +
gloss), which works perfectly. So I want to understand exactly why this
works. word_info and parse and gloss are all tuples. does str convert
the three into an ascii string? but the join method retains their
unicode status.

In the text file, the unicode characters appear perfectly, so I'm very
happy.

cheers
matthew

May 17 '06 #12

Martin v. Löwis

manstey wrote:

Thanks very much. Your def comma_separated_utf8(items): approach raises
an exception in codecs.py, so I tried = u", ".join(word_info + parse +
gloss), which works perfectly. So I want to understand exactly why this
works. word_info and parse and gloss are all tuples. does str convert
the three into an ascii string?
Correct: a tuple is converted into a string with (contents), where
contents is achieved through comma-separating repr() of each tuple
element. repr(a_unicode_string) creates a \x or \u representation.
but the join method retains their unicode status.
Correct. The result is a Unicode string if the joiner is a Unicode
string, and all tuple elements are Unicode strings. If one is not,
a conversion to Unicode is attempted.
In the text file, the unicode characters appear perfectly, so I'm very
happy.

Glad it works.

Regards,
Martin

May 17 '06 #13

Similar topics

7071

Windows XP - Environment variable - Unicode

by: sebastien.hugues | last post by:

Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name:...

Python

25879

convert Unicode to lower/uppercase?

by: Hallvard B Furuseth | last post by:

Has someone got a Python routine or module which converts Unicode strings to lowercase (or uppercase)? What I actually need to do is to compare a number of strings in a case-insensitive manner,...

Python

17594

Writing UTF-8 string to UNICODE file

by: Michael Weir | last post by:

I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...

Python

5251

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

3639

Unicode BOM marks

by: Francis Girard | last post by:

Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary...

Python

4578

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

by: Zenobia | last post by:

Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...

HTML / CSS

2610

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...

Python

8012

Convert DOS Cyrillic text to Unicode

by: Nikolay Petrov | last post by:

How can I convert DOS cyrillic text to Unicode

Visual Basic .NET

3262

Portable Code that supports Unicode

by: Tomás | last post by:

Let's start off with: class Nation { public: virtual const char* GetName() const = 0; } class Norway : public Nation { public: virtual const char* GetName() const

C / C++

8993

Convertion of Unicode to ASCII NIGHTMARE

by: ChaosKCW | last post by:

Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special...

Python

7129

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7333

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7398

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7061

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

7502

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5637

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

3194

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1566

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

428

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General