Becoming Unicode Aware

Michael Foord

I'm trying to become 'unicode-aware'... *sigh*. What's that quote - 'a
native speaker of ascii will never learn to speak unicode like a
native'. The trouble is I think I've been a native speaker of latin-1
without realising it.

My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?

Anyway - ConfigObj reads config files from plain text files. Is there
a standard for specifying the encoding within the text file ? I know
python scripts have a method - should I just use that ?

Also - suppose I know the encoding, or let the programmer specify, is
the following sufficient for reading the files in :

def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)

Regards,
Fuzzy
http://www.voidspace.org.uk/atlantib...thonutils.html

Jul 18 '05 #1

Subscribe Reply

1403

Diez B. Roggisch

> My main problem with udnerstanding unicode is what to do with

arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?
Unfortunately the http standard seems to lack a specification how form data
encoding is to be transferred. But it seems that most browser which
understand a certain encoding your page is delivered in will use that for
replying.

Anyway - ConfigObj reads config files from plain text files. Is there
a standard for specifying the encoding within the text file ? I know
python scripts have a method - should I just use that ?
No idea what configobj is - is it you own config parser?
Also - suppose I know the encoding, or let the programmer specify, is
the following sufficient for reading the files in :

def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)

Yes, it should be - but why the if? It is unnecessary, as its condition will
always be true - and you _want_ it that way, as the result of afunction
should always be unicode objects, no matter what encoding was used.
--
Regards,

Diez B. Roggisch

Jul 18 '05 #2

Jim Hefferon

fu******@gmail. com (Michael Foord) wrote ...

I'm trying to become 'unicode-aware'... *sigh*. What's that quote - 'a
native speaker of ascii will never learn to speak unicode like a
native'. The trouble is I think I've been a native speaker of latin-1
without realising it. It *is* odd, IMHO, that my database connector spits out strings-like
things that have 8-bit data so that when I
"".join(array_o f_database_stri ngs) them, I get a failure. I've
learned to by-hand them into unicode strings, but it is odd.
Something like a pair (encoding,strin g) seems more natural to me, but
probably I just don't get the issues.
My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?

I found this link
https://bugzilla.mozilla.org/show_bug.cgi?id=18643#c12
useful.

Jim

Jul 18 '05 #3

Just

In article <54************ **************@ posting.google. com>,
jh*******@smcvt .edu (Jim Hefferon) wrote:

My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?

I found this link
https://bugzilla.mozilla.org/show_bug.cgi?id=18643#c12
useful.

Likewise, I found this link
http://www.w3schools.com/tags/tag_form.asp
useful. See the accept-charset atribute.

Just

Jul 18 '05 #4

Bengt Richter

On Wed, 27 Oct 2004 12:56:32 +0200, "Diez B. Roggisch" <de*********@we b.de> wrote:

My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?
Unfortunatel y the http standard seems to lack a specification how form data
encoding is to be transferred. But it seems that most browser which
understand a certain encoding your page is delivered in will use that for
replying.

Anyway - ConfigObj reads config files from plain text files. Is there
a standard for specifying the encoding within the text file ? I know
python scripts have a method - should I just use that ?

No idea what configobj is - is it you own config parser?
Also - suppose I know the encoding, or let the programmer specify, is
the following sufficient for reading the files in :

def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)

Yes, it should be - but why the if? It is unnecessary, as its condition will
always be true - and you _want_ it that way, as the result of afunction

^^^^^^^^^^^^^^

afunction(lines , None)

would seem to be a feasible call ;-)
should always be unicode objects, no matter what encoding was used.

Regards,
Bengt Richter

Jul 18 '05 #5

Alex Martelli

Michael Foord <fu******@gmail .com> wrote:
...

def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)

This snippet as posted is a complicated "no-op but raise an error for
invalidly encoded lines", if it's the whole function.

Assuming the so-called setoflines IS not a set but a list (order
normally matters in such cases), you may rather want:

def afunction(setof lines, encoding='ascii '):
for i, line in enumerate(setof lines):
setoflines[i] = line.decode(enc oding)

The removal of the 'if' is just the same advice you were already given;
if you want to be able to explicitly pass encoding='' to AVOID the
decode (the whole purpose of the function), just insert a firs line

if not encoding: return

rather than repeating the test in the loop. But the key change is to
use enumerate to get indices as well as values, and assign into the
indexing in order to update 'setoflines' in-place; assigning to the
local variable 'line' (assuming, again, that you didn't snip your code
w/o a mention of that) is no good.

A good alternative might alternatively be

setoflines[:] = [line.decode(enc oding) for line in setoflines]

assuming again that you want the change to happen in-place.
Alex

Jul 18 '05 #6

Carl Banks

fu******@gmail. com (Michael Foord) wrote in message news:<6f******* *************** ****@posting.go ogle.com>...

My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?
Generally speaking, you have to ask (either the user or the software).
There's no reliable way to tell what encoding you're looking at
without someone or something telling you; you might be able to make a
heuristical guess, but that's it.

Anyway - ConfigObj reads config files from plain text files. Is there
a standard for specifying the encoding within the text file ? I know
python scripts have a method - should I just use that ?
It's a good method if you expect people to be editing the config file
with Emacs. It's a good enough method if you haven't any good reason
to use another method.

Also - suppose I know the encoding, or let the programmer specify, is
the following sufficient for reading the files in :

def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)

For most encodings, this'll work fine. But there are some encodings,
for example UTF-16, that won't work with it. UTF-16 fails for two
reasons: the two-byte characters interfere with the line buffering,
and UTF-16 strings must be preceded by a two-byte code indicating
endianness, which would be at the beginning of the file but not of
each line.

Fortunately, most text files aren't in UTF-16. I mention this so that
you are aware that, although afunction works in most cases, it is not
universal.

I believe it's the purpose of the StreamReader and StreamWriter
classes in the codecs module to deal with such situations.
--
CARL BANKS

Jul 18 '05 #7

Diez B. Roggisch

afunction(lines , None)

would seem to be a feasible call ;-)

Ok, I admit that I didn't think of _that_ stupid possibility :)
Nevertheless: he wants unicode objects, so he should make sure he gets
them....
--
Regards,

Diez B. Roggisch

Jul 18 '05 #8

Michael Foord

al*****@yahoo.c om (Alex Martelli) wrote in message news:<1gmd10n.1 xt7l7q6ahyacN%a l*****@yahoo.co m>...

Michael Foord <fu******@gmail .com> wrote:
...
def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)
This snippet as posted is a complicated "no-op but raise an error for
invalidly encoded lines", if it's the whole function.

:-)
It wouldn't be the whole function...... glad you attribute me with
some intelligence ;-)
Assuming the so-called setoflines IS not a set but a list (order
normally matters in such cases), you may rather want:

def afunction(setof lines, encoding='ascii '):
for i, line in enumerate(setof lines):
setoflines[i] = line.decode(enc oding)

The removal of the 'if' is just the same advice you were already given;
if you want to be able to explicitly pass encoding='' to AVOID the
decode (the whole purpose of the function), just insert a firs line

if not encoding: return

rather than repeating the test in the loop. But the key change is to
use enumerate to get indices as well as values, and assign into the
indexing in order to update 'setoflines' in-place; assigning to the
local variable 'line' (assuming, again, that you didn't snip your code
w/o a mention of that) is no good.
The rest of the function (which I didn't show) would actually process
the lines one by one......

Regards,
Fuzzy
http://www.voidspace.org.uk/atlantib...thonutils.html

A good alternative might alternatively be

setoflines[:] = [line.decode(enc oding) for line in setoflines]

assuming again that you want the change to happen in-place.
Alex

Jul 18 '05 #9

Egil M?ller

> Unfortunately the http standard seems to lack a specification how form data

encoding is to be transferred. But it seems that most browser which
understand a certain encoding your page is delivered in will use that for
replying.

Fourtunately, this is utter bullshit :)

Send the Content-Type http header to the client, with the value
"text/html; charset=UTF-8". You may have to send it both as an HTTP
header and as a meta http-equiv-HTML tag to get it to work with all
browsers though. Usually (I don't knwo if it is really in the standard
that the client have to behave this way), the client will reply in the
same encoding as you sent your page with the form. Anyway, the client
will prolly set a similar tag upon reply, but I don't know about that,
and don't care as just expecting the same encoding works for all major
browsers (mozilla, IE, opera).

Jul 18 '05 #10

Similar topics

3668

Unicode BOM marks

by: Francis Girard | last post by:

Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary representation, some systems add at the beginning of the file a BOM mark (Windows?), some don't. (Linux?). Therefore, the exact same text encoded in the same UTF-8 will result in two different binary files, and of a slightly different length. Right ?

Python

5684

Trouble saving unicode text to file

by: Svennglenn | last post by:

I'm working on a program that is supposed to save different information to text files. Because the program is in swedish i have to use unicode text for ÅÄÖ letters. When I run the following testscript I get an error message. # -*- coding: cp1252 -*-

Python

4649

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

by: Zenobia | last post by:

Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the

HTML / CSS

49735

std::string vs. Unicode UTF-8

by: Wolfgang Draxinger | last post by:

I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size(). Instead one has to iterate through the string, parse all UTF-8 multibytes and count each multibyte as one character. To address this problem the GTKmm bindings for the GTK+ toolkit have implemented a own string class Glib::ustring...

C / C++

4206

Unicode & Pythonwin / win32 / console?

by: Robert | last post by:

Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the screen - best results according to the platform/language settings (mainly XP Home, W2K, ...). Also unicode strings should be displayed as nice as possible at the console with normal print-s to stdout (on varying platforms, different

Python

3316

Portable Code that supports Unicode

by: Tomás | last post by:

Let's start off with: class Nation { public: virtual const char* GetName() const = 0; } class Norway : public Nation { public: virtual const char* GetName() const

C / C++

3047

unicode mess in c++

by: damjan | last post by:

This may look like a silly question to someone, but the more I try to understand Unicode the more lost I feel. To say that I am not a beginner C++ programmer, only had no need to delve into character encoding intricacies before. In c/c++, the unicode characters are introduced by the means of wchar_t type. Based on the presence of _UNICODE definition C functions are macro'd to either the normal version or the one prefixed with w. Because...

C / C++

13894

Unicode and utf 8 /utf 16

by: archana | last post by:

Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2 character in streamreader and streamwriter. How unicode and utf chacters are stored.

C# / C Sharp

2945

error messages containing unicode

by: Jim | last post by:

Hello, I'm trying to write exception-handling code that is OK in the presence of unicode error messages. I seem to have gotten all mixed up and I'd appreciate any un-mixing that anyone can give me. I'm used to writing code like this.

Python

9575

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10320

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10073

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

9134

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7609

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6846

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5513

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5645

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3806

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP