Chris Curvey wrote:
Hey all,
I'm trying to write something that will "fail fast" if one of my users
gives me non-latin-1 characters. So I tried this:
>testString = "\x80"
foo = unicode(testStr ing, "latin-1")
foo
u'\x80'
I would have thought that that should have raised an error, because
\x80 is not a valid character in latin-1 (according to what I can
find). Is this the expected behavior, or am I missing something?
Depends on what you call 'latin-1'. The standard ISO 8859-1 defined
only displayable characters. If you used that definition, even the
basic ASCII carriage return, line feed and tab would raise an error.
However, according to wikipedia:
"""In 1992, the IANA registered the character map ISO_8859-1:1987, more
commonly known by its preferred MIME name of ISO-8859-1 (note the extra
hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the
Internet. This map assigns the C0 and C1 control characters to the code
values 00-1F, 7F, and 80-9F. It thus provides for 256 characters
via every possible 8-bit value."""
'latin-1' and 'iso-8859-1' are the same encoding.
If you articulate your definition of "valid latin-1", we should be able
to help you with some Python code to check it for you.
>
I'm on Windows, but I have explicitly set the character set to be
latin-1 in sitecustomize.p y
Why??
Don't do that. That's a self-inflicted double whammy.
(1) You should *not* assume that all the legacy str data your machine
will ever process is in only one encoding.
(2) On a Windows machine, your legacy data is extremely likely to be
encoded in a Microsoft-developed encoding (like cp1252), not latin-1.
>
>import sys
sys.getdefault encoding()
'latin-1'
HTH,
John