473,804 Members | 3,397 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Becoming Unicode Aware

I'm trying to become 'unicode-aware'... *sigh*. What's that quote - 'a
native speaker of ascii will never learn to speak unicode like a
native'. The trouble is I think I've been a native speaker of latin-1
without realising it.

My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?

Anyway - ConfigObj reads config files from plain text files. Is there
a standard for specifying the encoding within the text file ? I know
python scripts have a method - should I just use that ?

Also - suppose I know the encoding, or let the programmer specify, is
the following sufficient for reading the files in :

def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)

Regards,
Fuzzy
http://www.voidspace.org.uk/atlantib...thonutils.html
Jul 18 '05 #1
10 1403
> My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?
Unfortunately the http standard seems to lack a specification how form data
encoding is to be transferred. But it seems that most browser which
understand a certain encoding your page is delivered in will use that for
replying.

Anyway - ConfigObj reads config files from plain text files. Is there
a standard for specifying the encoding within the text file ? I know
python scripts have a method - should I just use that ?
No idea what configobj is - is it you own config parser?
Also - suppose I know the encoding, or let the programmer specify, is
the following sufficient for reading the files in :

def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)


Yes, it should be - but why the if? It is unnecessary, as its condition will
always be true - and you _want_ it that way, as the result of afunction
should always be unicode objects, no matter what encoding was used.
--
Regards,

Diez B. Roggisch
Jul 18 '05 #2
fu******@gmail. com (Michael Foord) wrote ...
I'm trying to become 'unicode-aware'... *sigh*. What's that quote - 'a
native speaker of ascii will never learn to speak unicode like a
native'. The trouble is I think I've been a native speaker of latin-1
without realising it. It *is* odd, IMHO, that my database connector spits out strings-like
things that have 8-bit data so that when I
"".join(array_o f_database_stri ngs) them, I get a failure. I've
learned to by-hand them into unicode strings, but it is odd.
Something like a pair (encoding,strin g) seems more natural to me, but
probably I just don't get the issues.
My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?

I found this link
https://bugzilla.mozilla.org/show_bug.cgi?id=18643#c12
useful.

Jim
Jul 18 '05 #3
In article <54************ **************@ posting.google. com>,
jh*******@smcvt .edu (Jim Hefferon) wrote:
My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?

I found this link
https://bugzilla.mozilla.org/show_bug.cgi?id=18643#c12
useful.


Likewise, I found this link
http://www.w3schools.com/tags/tag_form.asp
useful. See the accept-charset atribute.

Just
Jul 18 '05 #4
On Wed, 27 Oct 2004 12:56:32 +0200, "Diez B. Roggisch" <de*********@we b.de> wrote:
My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?
Unfortunatel y the http standard seems to lack a specification how form data
encoding is to be transferred. But it seems that most browser which
understand a certain encoding your page is delivered in will use that for
replying.

Anyway - ConfigObj reads config files from plain text files. Is there
a standard for specifying the encoding within the text file ? I know
python scripts have a method - should I just use that ?


No idea what configobj is - is it you own config parser?
Also - suppose I know the encoding, or let the programmer specify, is
the following sufficient for reading the files in :

def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)


Yes, it should be - but why the if? It is unnecessary, as its condition will
always be true - and you _want_ it that way, as the result of afunction

^^^^^^^^^^^^^^

afunction(lines , None)

would seem to be a feasible call ;-)
should always be unicode objects, no matter what encoding was used.


Regards,
Bengt Richter
Jul 18 '05 #5
Michael Foord <fu******@gmail .com> wrote:
...
def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)


This snippet as posted is a complicated "no-op but raise an error for
invalidly encoded lines", if it's the whole function.

Assuming the so-called setoflines IS not a set but a list (order
normally matters in such cases), you may rather want:

def afunction(setof lines, encoding='ascii '):
for i, line in enumerate(setof lines):
setoflines[i] = line.decode(enc oding)

The removal of the 'if' is just the same advice you were already given;
if you want to be able to explicitly pass encoding='' to AVOID the
decode (the whole purpose of the function), just insert a firs line

if not encoding: return

rather than repeating the test in the loop. But the key change is to
use enumerate to get indices as well as values, and assign into the
indexing in order to update 'setoflines' in-place; assigning to the
local variable 'line' (assuming, again, that you didn't snip your code
w/o a mention of that) is no good.

A good alternative might alternatively be

setoflines[:] = [line.decode(enc oding) for line in setoflines]

assuming again that you want the change to happen in-place.
Alex
Jul 18 '05 #6
fu******@gmail. com (Michael Foord) wrote in message news:<6f******* *************** ****@posting.go ogle.com>...
My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?
Generally speaking, you have to ask (either the user or the software).
There's no reliable way to tell what encoding you're looking at
without someone or something telling you; you might be able to make a
heuristical guess, but that's it.

Anyway - ConfigObj reads config files from plain text files. Is there
a standard for specifying the encoding within the text file ? I know
python scripts have a method - should I just use that ?
It's a good method if you expect people to be editing the config file
with Emacs. It's a good enough method if you haven't any good reason
to use another method.

Also - suppose I know the encoding, or let the programmer specify, is
the following sufficient for reading the files in :

def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)


For most encodings, this'll work fine. But there are some encodings,
for example UTF-16, that won't work with it. UTF-16 fails for two
reasons: the two-byte characters interfere with the line buffering,
and UTF-16 strings must be preceded by a two-byte code indicating
endianness, which would be at the beginning of the file but not of
each line.

Fortunately, most text files aren't in UTF-16. I mention this so that
you are aware that, although afunction works in most cases, it is not
universal.

I believe it's the purpose of the StreamReader and StreamWriter
classes in the codecs module to deal with such situations.
--
CARL BANKS
Jul 18 '05 #7
>
afunction(lines , None)

would seem to be a feasible call ;-)


Ok, I admit that I didn't think of _that_ stupid possibility :)
Nevertheless: he wants unicode objects, so he should make sure he gets
them....
--
Regards,

Diez B. Roggisch
Jul 18 '05 #8
al*****@yahoo.c om (Alex Martelli) wrote in message news:<1gmd10n.1 xt7l7q6ahyacN%a l*****@yahoo.co m>...
Michael Foord <fu******@gmail .com> wrote:
...
def afunction(setof lines, encoding='ascii '):
for line in setoflines:
if encoding:
line = line.decode(enc oding)
This snippet as posted is a complicated "no-op but raise an error for
invalidly encoded lines", if it's the whole function.


:-)
It wouldn't be the whole function...... glad you attribute me with
some intelligence ;-)
Assuming the so-called setoflines IS not a set but a list (order
normally matters in such cases), you may rather want:

def afunction(setof lines, encoding='ascii '):
for i, line in enumerate(setof lines):
setoflines[i] = line.decode(enc oding)

The removal of the 'if' is just the same advice you were already given;
if you want to be able to explicitly pass encoding='' to AVOID the
decode (the whole purpose of the function), just insert a firs line

if not encoding: return

rather than repeating the test in the loop. But the key change is to
use enumerate to get indices as well as values, and assign into the
indexing in order to update 'setoflines' in-place; assigning to the
local variable 'line' (assuming, again, that you didn't snip your code
w/o a mention of that) is no good.
The rest of the function (which I didn't show) would actually process
the lines one by one......

Regards,
Fuzzy
http://www.voidspace.org.uk/atlantib...thonutils.html

A good alternative might alternatively be

setoflines[:] = [line.decode(enc oding) for line in setoflines]

assuming again that you want the change to happen in-place.
Alex

Jul 18 '05 #9
> Unfortunately the http standard seems to lack a specification how form data
encoding is to be transferred. But it seems that most browser which
understand a certain encoding your page is delivered in will use that for
replying.


Fourtunately, this is utter bullshit :)

Send the Content-Type http header to the client, with the value
"text/html; charset=UTF-8". You may have to send it both as an HTTP
header and as a meta http-equiv-HTML tag to get it to work with all
browsers though. Usually (I don't knwo if it is really in the standard
that the client have to behave this way), the client will reply in the
same encoding as you sent your page with the form. Anyway, the client
will prolly set a similar tag upon reply, but I don't know about that,
and don't care as just expecting the same encoding works for all major
browsers (mozilla, IE, opera).
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
3668
by: Francis Girard | last post by:
Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary representation, some systems add at the beginning of the file a BOM mark (Windows?), some don't. (Linux?). Therefore, the exact same text encoded in the same UTF-8 will result in two different binary files, and of a slightly different length. Right ?
19
5684
by: Svennglenn | last post by:
I'm working on a program that is supposed to save different information to text files. Because the program is in swedish i have to use unicode text for ÅÄÖ letters. When I run the following testscript I get an error message. # -*- coding: cp1252 -*-
48
4649
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
32
49735
by: Wolfgang Draxinger | last post by:
I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size(). Instead one has to iterate through the string, parse all UTF-8 multibytes and count each multibyte as one character. To address this problem the GTKmm bindings for the GTK+ toolkit have implemented a own string class Glib::ustring...
7
4206
by: Robert | last post by:
Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the screen - best results according to the platform/language settings (mainly XP Home, W2K, ...). Also unicode strings should be displayed as nice as possible at the console with normal print-s to stdout (on varying platforms, different
13
3316
by: Tomás | last post by:
Let's start off with: class Nation { public: virtual const char* GetName() const = 0; } class Norway : public Nation { public: virtual const char* GetName() const
12
3047
by: damjan | last post by:
This may look like a silly question to someone, but the more I try to understand Unicode the more lost I feel. To say that I am not a beginner C++ programmer, only had no need to delve into character encoding intricacies before. In c/c++, the unicode characters are introduced by the means of wchar_t type. Based on the presence of _UNICODE definition C functions are macro'd to either the normal version or the one prefixed with w. Because...
6
13894
by: archana | last post by:
Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2 character in streamreader and streamwriter. How unicode and utf chacters are stored.
9
2945
by: Jim | last post by:
Hello, I'm trying to write exception-handling code that is OK in the presence of unicode error messages. I seem to have gotten all mixed up and I'd appreciate any un-mixing that anyone can give me. I'm used to writing code like this.
0
9575
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10320
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10073
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9134
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7609
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6846
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5513
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5645
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3806
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.