473,406 Members | 2,439 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

unicode string literals and "u" prefix

In my python scripts, I use a lot of accented characters as I work in
french.
In order to do this, I put the line
# -*- coding: UTF-8 -*-
at the beginning of the script file.
Then, when I need to store accented characters in a string, I used to
prefix the literal string with 'u', like this:
mystring = u"prénom"

But if I understand well, prefixing a unicode string literal with 'u'
will eventually become obsolete ( in python 3.0 ? ), as all strings
will be unicode in a more or less distant future.

So, to write "clean" script code, is it a good idea to write a script
like this ?

---- myscript ----

#! /usr/local/bin/python -U
# -*- coding: UTF-8 -*-

s = 'hélène'
print len(s)
print s

-------------------

The second line says that all string literals are encoded in UTF-8, as
I work with an editor that saves all my files as UTF-8.

Normally, I should write
s = u'hélène' but the -U python option make python considers string
literals as unicode string.
( I know the -U option can disappear in a next python version, but is
not better to delete the "-U" option at the top of the scripts than
all "u" unicode prefixes, when python will consider all strings as
unicode ?... )

Finally, I write
print s
instead of
print s.encode('utf-8')
as I used to because I want this script to work on computer with other
encodings.
It seems that "print" encodes by default with the shell current
encoding.

Is this the best way to deal with accented characters ?
Do you think that a script written like this will still work with
python 3.0 ?
Any comment ?
Jul 18 '05 #1
6 5465
nico wrote:
But if I understand well, prefixing a unicode string literal with 'u'
will eventually become obsolete ( in python 3.0 ? ), as all strings
will be unicode in a more or less distant future.
I think you misunderstand. It might become deprecated (in the sense
of becoming redundant yet still possible); in any case, this future
is certainly distant (maybe five or ten years).
So, to write "clean" script code, is it a good idea to write a script
like this ?
No. Do use Unicode literals whenever you can.
( I know the -U option can disappear in a next python version, but is
not better to delete the "-U" option at the top of the scripts than
all "u" unicode prefixes, when python will consider all strings as
unicode ?... )
As you say: *when*. Current Python doesn't, and explicit is better than
implicit. There is no plan yet as to when (or even if) to release Python
3.
It seems that "print" encodes by default with the shell current
encoding.
Yes, it should.
Do you think that a script written like this will still work with
python 3.0 ?


Most certainly. Even when string literals become Unicode by default,
u""-prefixes will still be accepted - most likely so for ten or
twenty years.

Regards,
Martin
Jul 18 '05 #2
Thank you a lot for your answer.

I understand better, now.
Nevertheless, all this unicode issue is quite confusing for beginners
( I started to learn Python two month ago... ).
And it seems that I am not the only one in this case.
In fact, I just came across this discussion of april 2003 "[Zope3-dev]
i18n, unicode, and the underline"
http://mail.zope.org/pipermail/zope3...il/006410.html.

Working for an insurance company, most of our data contain french
accented characters.
So, we are condemned to work essentially with unicode strings.
In fact, it is hard to find examples where plain ascii strings would
be useful in our case.
Even data we retrieve from databases are returned to us as unicode
strings.

That's why I tried to find a way to get rid of all those "u" prefixes
instead of systematically putting it in front of each unicode string
litteral, which is somewhat "noisy".
That's also because I am afraid that sometime someone will forget this
"u" prefix, and errors will be detected in a far more later stage, or
too late.
A way of defaulting all string literal as unicode would have been a
relief.

It would be good if we could just write a declaration at the beginning
of the source file like
# strings_are_unicode_by_default
We would write unicode strings without "u" prefix like this:
s="élément"
and if we really must have plain ascii strings, we could explicitely
prefix them with "a", for instance s=a"my plain ascii string".
Thus, everybody would be happy, and there will be no incidence about
all the already written codes or librairies.
But there must be issues I am not aware of, I suppose...

I think you have the same problem when you write strings in german
language.
But if it is no problem for you to prefix your strings with "u" like
in :
s=u"Vielen Dank für Ihre Antwort"
then we can live with it too, for the next twenty years.

Sometimes, I feel like an ethnical minority, when I see in a
well-known book about Python that "Because Unicode is a relatively
advanced and rarely used tool, we will omit further details in this
introductory text."
Working in a language with accented characters is definitively bad
luck.

Freundliche Grüsse

Nicolas Riesch
Jul 18 '05 #3
Hi,
Working for an insurance company, most of our data contain french
accented characters.
So, we are condemned to work essentially with unicode strings.
In fact, it is hard to find examples where plain ascii strings would
be useful in our case.
Even data we retrieve from databases are returned to us as unicode
strings.
This statement looks as if you confuse utf-8 and unicode. They are not the
same. The former is an encoding of the latter.
A way of defaulting all string literal as unicode would have been a
relief.


I can understand that wish, but it certainly would break too much existing
3rd-party-code. But I wonder if a tool as pychecker could be enhanced to
issue warnings on python code if string literals are not prefixed by an u.
--
Regards,

Diez B. Roggisch
Jul 18 '05 #4
nico wrote:
I think you have the same problem when you write strings in german
language.
I try to avoid putting non-English messages into source code (Python
or not). Instead, I often put English into source code, then use
gettext to fetch translations.
Sometimes, I feel like an ethnical minority, when I see in a
well-known book about Python that "Because Unicode is a relatively
advanced and rarely used tool, we will omit further details in this
introductory text."


I do use Unicode strings a lot in my Python applications. However,
I rarely use them in string literals. If I had to put accented/umlauted
characters into a Unicode literal, I had no problems putting u""
in front of the literal.

If you really need a way of declaring all string literals as Unicode,
on a per-module basis, then

from __future__ import string_literals_are_unicode

is an appropriate way of doing this. Of course, it does not work in
the current versions (including 2.4); it doesn't work because nobody
has contributed code to make it work.

So if you really need the feature, please implement it, and submit
the change to sf.net/projects/python. There is nothing wrong with
such a feature - just that nobody has implemented it.
This is how open source works.

Regards,
Martin
Jul 18 '05 #5
Martin v. Löwis wrote:
If you really need a way of declaring all string literals as Unicode,
on a per-module basis, then

from __future__ import string_literals_are_unicode


Were it to be done, would that also introduce new syntax for
generating a byte string?

Perhaps b"" as in

s = b"\N{LATIN"

?
Andrew
da***@dalkescientific.com
Jul 18 '05 #6
In article <kx******************@newsread3.news.pas.earthlink .net>,
Andrew Dalke <ad****@mindspring.com> wrote:
Martin v. Löwis wrote:
If you really need a way of declaring all string literals as Unicode,
on a per-module basis, then

from __future__ import string_literals_are_unicode


Were it to be done, would that also introduce new syntax for
generating a byte string?

Perhaps b"" as in

s = b"\N{LATIN"

?


IMO we should plan to move towards the following:

- all string literals should become unicode
- there should be a bytes() type for binary
strings
- there should be a way to use byte string
literals. b"..." seems a good candidate.

I doubt this can be done without breaking stuff (although a __future__
directive may make it possible), so maybe this is a 3.0 project.

There already is a PEP for a bytes type:
http://www.python.org/peps/pep-0296.html
...but it seems it's been dormant since 2002. Time to revive it?

Just
Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

20
by: Petter Reinholdtsen | last post by:
Is the code fragment 'char a = ("a");' valid ANSI C? The problematic part is '("a")'. I am sure 'char a = "a";' is valid ANSI C, but I am more unsure if it is allowed to place () around the...
6
by: kobu.selva | last post by:
I was recently part of a little debate on the issue of whether constants and string literals are considered "data objects" in C. I'm more confused now than before. I was always under the...
7
by: Eric Laberge | last post by:
Aloha! This question is meant to be about C99 and unnamed compound objects. As I read, if such a construct as int *p = (int){0}; is used within a function, then it has "automatic storage...
7
by: maruthir123 | last post by:
Hi, I have a wchar_t pointer. Based on some conditions I assign string literals to it and on some other conditions, I allocate memory and assign it to this. Is there a way to find out while...
4
by: Jorgen | last post by:
I have this situation: string h = "#u00d8"; h = h.Replace("#","\\"); // =h = @"\u00d8" WriteLine(h); //="\u00d8" But I wishes to have this situation: h = "\u00d8"; WriteLine(h); //="Ø" How...
26
by: anonieko | last post by:
In the past I always used "" everywhere for empty string in my code without a problem. Now, do you think I should use String.Empty instead of "" (at all times) ? Let me know your thoughts.
20
by: liujiaping | last post by:
I'm confused about the program below: int main(int argc, char* argv) { char str1 = "abc"; char str2 = "abc"; const char str3 = "abc"; const char str4 = "abc"; const char* str5 = "abc";
5
by: Romano Giannetti | last post by:
Hi, while writing some LaTeX preprocessing code, I stumbled into this problem: (I have a -*- coding: utf-8 -*- line, obviously) s = ur"añado $\uparrow$" Which gave an error because the \u...
6
by: Devin | last post by:
So Python can have unicode variable names but you can't "explode" (**myvariable) a dict with unicode keys? WTF? -Devin
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.