By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,172 Members | 727 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,172 IT Pros & Developers. It's quick & easy.

Python nuube needs Unicode help

P: n/a
HELP!
Guy who was here before me wrote a script to parse files in Python.

Includes line:
print u
where u is a line from a file we are parsing.
However, we have started recieving data from Brazil. If I open file to
parse in VI, looks like:

<Utt id="3" transcribe="yes" audioRoot="A1"
audio="313-20070102144528.wav" grammarSet="G3" rawText="não"
recValue="{data:CHOICE=NO;}" conf="970" rawText2="" conf2="0"
transcribedText="não" parsableText="não"/

Clearly those "n&#227" are some non-Ascii characters, but how do I get
print to understand that?

I keep getting:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
position 40:
ordinal not in range(128)"

Jan 11 '07 #1
Share this Question
Share on Google+
4 Replies


P: n/a
gh************@gmail.com schrieb:
HELP!
Guy who was here before me wrote a script to parse files in Python.

Includes line:
print u
where u is a line from a file we are parsing.
However, we have started recieving data from Brazil. If I open file to
parse in VI, looks like:

<Utt id="3" transcribe="yes" audioRoot="A1"
audio="313-20070102144528.wav" grammarSet="G3" rawText="não"
recValue="{data:CHOICE=NO;}" conf="970" rawText2="" conf2="0"
transcribedText="não" parsableText="não"/

Clearly those "n&#227" are some non-Ascii characters, but how do I get
print to understand that?

I keep getting:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
position 40:
ordinal not in range(128)"
Does the error happen at the

print u

line? If yes, what happens is that you try and print a unicode object.
Which means that it has to be converted (actually the right term is
encoded) to a byte-string. If you don't do that explicitely, it will be
done implicitly, using the default encoding - which is ascii.

If you have non-ascii characters, you end up with the error you see.

What to do? Use something like this:

print u.encode('utf-8')

instead.

Diez
Jan 11 '07 #2

P: n/a
Progress! You managed to change the error message.

File "./acc_test_script_generator.py", line 106, in loadData
print u.encode('utf-8')
AttributeError: Utterance instance has no attribute 'encode'

I'm missing somethign really obvious here, but I don't know what it
is...
Diez B. Roggisch wrote:
gh************@gmail.com schrieb:
HELP!
Guy who was here before me wrote a script to parse files in Python.

Includes line:
print u
where u is a line from a file we are parsing.
However, we have started recieving data from Brazil. If I open file to
parse in VI, looks like:

<Utt id="3" transcribe="yes" audioRoot="A1"
audio="313-20070102144528.wav" grammarSet="G3" rawText="não"
recValue="{data:CHOICE=NO;}" conf="970" rawText2="" conf2="0"
transcribedText="não" parsableText="não"/

Clearly those "n&#227" are some non-Ascii characters, but how do I get
print to understand that?

I keep getting:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
position 40:
ordinal not in range(128)"

Does the error happen at the

print u

line? If yes, what happens is that you try and print a unicode object.
Which means that it has to be converted (actually the right term is
encoded) to a byte-string. If you don't do that explicitely, it will be
done implicitly, using the default encoding - which is ascii.

If you have non-ascii characters, you end up with the error you see.

What to do? Use something like this:

print u.encode('utf-8')

instead.

Diez
Jan 11 '07 #3

P: n/a
At Thursday 11/1/2007 18:27, gh************@gmail.com wrote:
>HELP!
Guy who was here before me wrote a script to parse files in Python.

Includes line:
print u
where u is a line from a file we are parsing.
However, we have started recieving data from Brazil. If I open file to
parse in VI, looks like:

<Utt id="3" transcribe="yes" audioRoot="A1"
audio="313-20070102144528.wav" grammarSet="G3" rawText="não"
recValue="{data:CHOICE=NO;}" conf="970" rawText2="" conf2="0"
transcribedText="não" parsableText="não"/
Is this part of an XML document? You should use a
true XML parser instead of doing that by hand.
>Clearly those "n&#227" are some non-Ascii characters, but how do I get
print to understand that?
Understanding how Unicode works may be very
useful: http://www.amk.ca/python/howto/unicode
>I keep getting:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in
position 40:
ordinal not in range(128)"
pyu = u"áéíóú"
pyprint u, repr(u)
áéíóú u'\xe1\xe9\xed\xf3\xfa'
pyprint str(u)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode
characters in position 0-4: ordin
al not in range(128)
pyprint u.encode('cp850')
áéíóú

(cp850 is my console encoding)
--
Gabriel Genellina
Softlab SRL


__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Jan 12 '07 #4

P: n/a
At Thursday 11/1/2007 20:42, gh************@gmail.com wrote:
Progress! You managed to change the error message.

File "./acc_test_script_generator.py", line 106, in loadData
print u.encode('utf-8')
AttributeError: Utterance instance has no attribute 'encode'

I'm missing somethign really obvious here, but I don't know what it
is...
Then you're not "printing a line from a file we are parsing", which
should be a string or unicode object. You're printing some
"Utterance" instance; probably it has a __str__ method, and there,
you're mixing unicode+strings.
--
Gabriel Genellina
Softlab SRL


__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Jan 12 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.