By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,285 Members | 1,691 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,285 IT Pros & Developers. It's quick & easy.

ensuring valid latin-1

P: n/a
Hey all,

I'm trying to write something that will "fail fast" if one of my users
gives me non-latin-1 characters. So I tried this:
>>testString = "\x80"
foo = unicode(testString, "latin-1")
foo
u'\x80'

I would have thought that that should have raised an error, because
\x80 is not a valid character in latin-1 (according to what I can
find). Is this the expected behavior, or am I missing something?

I'm on Windows, but I have explicitly set the character set to be
latin-1 in sitecustomize.py
>>import sys
sys.getdefaultencoding()
'latin-1'

Nov 29 '06 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Chris Curvey wrote:
Hey all,

I'm trying to write something that will "fail fast" if one of my users
gives me non-latin-1 characters. So I tried this:
>testString = "\x80"
foo = unicode(testString, "latin-1")
foo
u'\x80'

I would have thought that that should have raised an error, because
\x80 is not a valid character in latin-1 (according to what I can
find). Is this the expected behavior, or am I missing something?
Depends on what you call 'latin-1'. The standard ISO 8859-1 defined
only displayable characters. If you used that definition, even the
basic ASCII carriage return, line feed and tab would raise an error.
However, according to wikipedia:

"""In 1992, the IANA registered the character map ISO_8859-1:1987, more
commonly known by its preferred MIME name of ISO-8859-1 (note the extra
hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the
Internet. This map assigns the C0 and C1 control characters to the code
values 00-1F, 7F, and 80-9F. It thus provides for 256 characters
via every possible 8-bit value."""

'latin-1' and 'iso-8859-1' are the same encoding.

If you articulate your definition of "valid latin-1", we should be able
to help you with some Python code to check it for you.
>
I'm on Windows, but I have explicitly set the character set to be
latin-1 in sitecustomize.py
Why??

Don't do that. That's a self-inflicted double whammy.
(1) You should *not* assume that all the legacy str data your machine
will ever process is in only one encoding.
(2) On a Windows machine, your legacy data is extremely likely to be
encoded in a Microsoft-developed encoding (like cp1252), not latin-1.
>
>import sys
sys.getdefaultencoding()
'latin-1'
HTH,
John

Nov 29 '06 #2

This discussion thread is closed

Replies have been disabled for this discussion.