inserting Unicode character in dictionary - Python

gita ziabari

Hello All,

The following code does not work for unicode characters:

keyword = dict()
kw = 'ÇÅÎÓËÉÈ'
keyword.setdefault(key, []).append (kw)

It works fine for inserting ASCII character. Any suggestion?

Thanks,

Gita

Oct 17 '08 #1

Subscribe Post Reply

3866

Marc 'BlackJack' Rintsch

On Fri, 17 Oct 2008 13:07:38 -0400, gita ziabari wrote:

The following code does not work for unicode characters:

keyword = dict()
kw = 'Ð³ÐµÐ½ÑÐºÐ¸Ñ…'
keyword.setdefault(key, []).append (kw)

It works fine for inserting ASCII character. Any suggestion?

What do you mean by "does not work"? And you are aware that the above
snipped doesn't involve any unicode characters!? You have a byte string
there -- type `str` not `unicode`.

Ciao,
Marc 'BlackJack' Rintsch

Oct 17 '08 #2

Joe Strout

On Oct 17, 2008, at 11:24 AM, Marc 'BlackJack' Rintsch wrote:

>kw = 'Ð³ÐµÐ½ÑÐºÐ¸Ñ…'

What do you mean by "does not work"? And you are aware that the above
snipped doesn't involve any unicode characters!? You have a byte
string
there -- type `str` not `unicode`.

Just checking my understanding here -- are the following all true:

1. If you had prefixed that literal with a "u", then you'd have Unicode.

2. Exactly what Unicode you get would be dependent on Python properly
interpreting the bytes in the source file -- which you can make it do
by adding something like "-*- coding: utf-8 -*-" in a comment at the
top of the file.

3. Without the "u" prefix, you'll have some 8-bit string, whose
interpretation is... er... here's where I get a bit fuzzy. What if
your source file is set to utf-8? Do you then have a proper UTF-8
string, but the problem is that none of the standard Python library
methods know how to properly interpret UTF-8?

4. In Python 3.0, this silliness goes away, because all strings are
Unicode by default.

Thanks for any answers/corrections,
- Joe

Oct 17 '08 #3

Marc 'BlackJack' Rintsch

On Fri, 17 Oct 2008 11:32:36 -0600, Joe Strout wrote:

On Oct 17, 2008, at 11:24 AM, Marc 'BlackJack' Rintsch wrote:

>>kw = 'Ð³ÐµÐ½ÑÐºÐ¸Ñ…'

What do you mean by "does not work"? And you are aware that the above
snipped doesn't involve any unicode characters!? You have a byte
string there -- type `str` not `unicode`.

Just checking my understanding here -- are the following all true:

1. If you had prefixed that literal with a "u", then you'd have Unicode.

Yes.

2. Exactly what Unicode you get would be dependent on Python properly
interpreting the bytes in the source file -- which you can make it do by
adding something like "-*- coding: utf-8 -*-" in a comment at the top of
the file.

Yes, assuming the encoding on that comment matches the actual encoding of
the file.

3. Without the "u" prefix, you'll have some 8-bit string, whose
interpretation is... er... here's where I get a bit fuzzy.

No interpretation at all, just the bunch of bytes that happen to be in
the source file.

What if your source file is set to utf-8? Do you then have a proper
UTF-8 string, but the problem is that none of the standard Python
library methods know how to properly interpret UTF-8?

Well, the decode method knows how to decode that bytes into a `unicode`
object if you call it with 'utf-8' as argument.

4. In Python 3.0, this silliness goes away, because all strings are
Unicode by default.

Yes and no. The problem just shifts because at some point you get into
similar troubles, just in the other direction. Data enters the program
as bytes and must leave it as bytes again, so you have to deal with
encodings at those points.

Ciao,
Marc 'BlackJack' Rintsch

Oct 17 '08 #4

Joe Strout

Thanks for the answers. That clears things up quite a bit.

>What if your source file is set to utf-8? Do you then have a proper
UTF-8 string, but the problem is that none of the standard Python
library methods know how to properly interpret UTF-8?

Well, the decode method knows how to decode that bytes into a
`unicode`
object if you call it with 'utf-8' as argument.

OK, good to know.

>4. In Python 3.0, this silliness goes away, because all strings are
Unicode by default.

Yes and no. The problem just shifts because at some point you get
into
similar troubles, just in the other direction. Data enters the
program
as bytes and must leave it as bytes again, so you have to deal with
encodings at those points.

Yes, but that's still much better than having to litter your code with
'u' prefixes and .decode calls and so on. If I'm using a UTF-8-savvy
text editor (as we all should be doing in the 21st century!), and type
"foo = '2Ï€'", I should get a string containing a '2' and a pi
character, and all the text operations (like counting characters,
etc.) should Just Work.

When I read and write files or sockets or whatever, of course I'll
have to think about what encoding the text should be... but internal
to my own source code, I shouldn't have to.

I understand the need for a transition strategy, which is what we have
in 2.x, and that's working well enough. But I'll be glad when it's
over. :)

Cheers,
- Joe

Oct 17 '08 #5

=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=

2. Exactly what Unicode you get would be dependent on Python properly

interpreting the bytes in the source file -- which you can make it do by
adding something like "-*- coding: utf-8 -*-" in a comment at the top of
the file.

That depends on the Python version. Up to (and including) 2.4, the bytes
on the disk where interpreted as Latin-1 in absence of an encoding
declaration. In 2.5, not having an encoding declaration is an error. In
3.x, in absence of an encoding declaration, the bytes are interpreted as
UTF-8 (giving an error when ill-formed UTF-8 sequences are encountered).

3. Without the "u" prefix, you'll have some 8-bit string, whose
interpretation is... er... here's where I get a bit fuzzy. What if your
source file is set to utf-8?

You need to distinguish between the declared encoding, and the intended
(editor) encoding also. Some editors (like Emacs or IDLE) interpret the
declaration, others may not. What you see on the display is the editor's
interpretation; what Python uses is the declared encoding.

However, Python uses the declared encoding just for Unicode strings.

Do you then have a proper UTF-8 string,
but the problem is that none of the standard Python library methods know
how to properly interpret UTF-8?

There is (probably) no such thing as a "proper UTF-8 string" (in the
sense in which you probably mean it). Python doesn't have a data type
for "UTF-8 string". It only has a data type "byte string". It's up to
the application whether it gets interpreted in a consistent manner.
Libraries are (typically) encoding-agnostic, i.e. they work for UTF-8
encoded strings the same way as for, say, Big-5 encoded strings.

4. In Python 3.0, this silliness goes away, because all strings are
Unicode by default.

You still need to make sure that the editor's encoding and the declared
encoding match.

Regards,
Martin

Oct 18 '08 #6

gitaziabari

On Oct 17, 2:38 pm, Joe Strout <j...@strout.netwrote:

Thanks for the answers. That clears things up quite a bit.

What if your source file is set to utf-8? Do you then have a proper
UTF-8 string, but the problem is that none of the standard Python
library methods know how to properly interpret UTF-8?

Well, the decode method knows how to decode that bytes into a
`unicode`
object if you call it with 'utf-8' as argument.

OK, good to know.

4. In Python 3.0, this silliness goes away, because all strings are
Unicode by default.

Yes and no. The problem just shifts because at some point you get
into
similar troubles, just in the other direction. Data enters the
program
as bytes and must leave it as bytes again, so you have to deal with
encodings at those points.

Yes, but that's still much better than having to litter your code with
'u' prefixes and .decode calls and so on. If I'm using a UTF-8-savvy
text editor (as we all should be doing in the 21st century!), and type
"foo = '2ð'", I should get a string containing a '2' and a pi
character, and all the text operations (like counting characters,
etc.) should Just Work.

When I read and write files or sockets or whatever, of course I'll
have to think about what encoding the text should be... but internal
to my own source code, I shouldn't have to.

I understand the need for a transition strategy, which is what we have
in 2.x, and that's working well enough. But I'll be glad when it's
over. :)

Cheers,
- Joe

Thanks for the answers. The following factors should be considerd when
dealing with unicode characters in python:
1. Declaring # -*- coding: utf-8 -*- at the top of script
2. Opening files with appropriate encoding:
txt = codecs.open (filename, 'w+', encoding='utf-8')

My program works fine now. There is no specific way of adding unicode
characters in list or dictionaies. The character itself has to be in
unicode.

Cheers,

Gita

Oct 19 '08 #7

by: sebastien.hugues | last post by:

Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name:...

Python

convert Unicode to lower/uppercase?

by: Hallvard B Furuseth | last post by:

Has someone got a Python routine or module which converts Unicode strings to lowercase (or uppercase)? What I actually need to do is to compare a number of strings in a case-insensitive manner,...

Python

Question concerning Unicode and or Shift-JIS

by: Antioch | last post by:

Ok, so Im a newb python programmer and I'm trying to create a simple python web-application. The program is simply going to read in pairs of words, parse them into a dictionary file, then randomly...

Python

Unicode charmap decoders slow

by: Tony Nelson | last post by:

Is there a faster way to decode from charmaps to utf-8 than unicode()? I'm writing a small card-file program. As a test, I use a 53 MB MBox file, in mac-roman encoding. My program reads and...

Python

A Unicode problem -HELP

by: manstey | last post by:

I am writing a program to translate a list of ascii letters into a different language that requires unicode encoding. This is what I have done so far: 1. I have ï»¿# -*- coding: UTF-8 -*- as my...

Python

Need to extract XML or SGML entities from a Unicode text

by: Frantic | last post by:

I'm working on a list of japaneese entities that contain the entity, the unicode hexadecimal code and the xml/sgml entity used for that entity. A unicode document is read into the program, then the...

.NET Framework

unicode "table of character" implementation in python

by: Nicolas Pontoizeau | last post by:

Hi, I am handling a mixed languages text file encoded in UTF-8. Theres is mainly French, English and Asian languages. I need to detect every asian characters in order to enclose it by a special...

Python

unicode

by: 7stud | last post by:

Based on this example and the error: ----- u_str = u"abc\u9999" print u_str UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in position 3: ordinal not in range(128) ------

Python

Re: convert unicode characters to visibly similar ascii characters

by: Terry Reedy | last post by:

Peter Bulychev wrote: I believe you will have to make up your own translation dictionary for the translations *you* want. You should then be able to use that with the .translate() method. tjr

Python

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

inserting Unicode character in dictionary - Python

Similar topics