By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,965 Members | 1,631 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,965 IT Pros & Developers. It's quick & easy.

Best ways of managing text encodings in source/regexes?

P: n/a
Hi

I've read around quite a bit about Unicode and python's support for
it, and I'm still unclear about how it all fits together in certain
scenarios. Can anyone help clarify?

* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)

* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)? This seems inevitable given that
standard library modules such as re don't declare an encoding,
presumably because in that case I don't see any non-ASCII chars in the
source.

* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

I've been trying to understand this for a while so any clarification
would be a great help.

Tim
Nov 26 '07 #1
Share this Question
Share on Google+
4 Replies


P: n/a
* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)
Depends on what you want to achieve. If you don't prefix your strings
with u, they will stay byte string objects, and won't become Unicode
strings. That should be fine for strings that are pure ASCII; for
ISO-8859-1 strings, I recommend it is safer to only use Unicode
objects to represent such strings.

In Py3k, that will change - string literals will automatically be
Unicode objects.
* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)?
Yes, it will. The encoding declaration is per-module.
* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?
It will work fine if the regular expression restricts itself to ASCII,
and doesn't rely on any of the locale-specific character classes (such
as \w). If it's beyond ASCII, or does use such escapes, you better make
it a Unicode expression.

I'm not actually sure what precisely the semantics is when you match
an expression compiled from a byte string against a Unicode string,
or vice versa. I believe it operates on the internal representation,
so \xf6 in a byte string expression matches with \u00f6 in a Unicode
string; it won't try to convert one into the other.

Regards,
Martin
Nov 26 '07 #2

P: n/a
On Nov 27, 12:27 am, "Martin v. Lwis" <mar...@v.loewis.dewrote:
* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)

Depends on what you want to achieve. If you don't prefix your strings
with u, they will stay byte string objects, and won't become Unicode
strings. That should be fine for strings that are pure ASCII; for
ISO-8859-1 strings, I recommend it is safer to only use Unicode
objects to represent such strings.

In Py3k, that will change - string literals will automatically be
Unicode objects.
* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)?

Yes, it will. The encoding declaration is per-module.
* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say
myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash
then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

It will work fine if the regular expression restricts itself to ASCII,
and doesn't rely on any of the locale-specific character classes (such
as \w). If it's beyond ASCII, or does use such escapes, you better make
it a Unicode expression.

I'm not actually sure what precisely the semantics is when you match
an expression compiled from a byte string against a Unicode string,
or vice versa. I believe it operates on the internal representation,
so \xf6 in a byte string expression matches with \u00f6 in a Unicode
string; it won't try to convert one into the other.

Regards,
Martin
Thanks Martin, that's a very helpful response to what I was concerned
might be an overly long query.

Yes, I'd read that in Py3k the distinction between byte strings and
Unicode strings would disappear -- I look forward to that...

Tim
Nov 26 '07 #3

P: n/a
On Nov 26, 2007 4:27 PM, "Martin v. Lwis" <ma****@v.loewis.dewrote:
myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

It will work fine if the regular expression restricts itself to ASCII,
and doesn't rely on any of the locale-specific character classes (such
as \w). If it's beyond ASCII, or does use such escapes, you better make
it a Unicode expression.
yes, you have to be careful when writing unicode-senstive regular expressions:
http://effbot.org/zone/unicode-objects.htm

"You can apply the same pattern to either 8-bit (encoded) or Unicode
strings. To create a regular expression pattern that uses Unicode
character classes for \w (and \s, and \b), use the "(?u)" flag prefix,
or the re.UNICODE flag:

pattern = re.compile("(?u)pattern")
pattern = re.compile("pattern", re.UNICODE)

"

>
I'm not actually sure what precisely the semantics is when you match
an expression compiled from a byte string against a Unicode string,
or vice versa. I believe it operates on the internal representation,
so \xf6 in a byte string expression matches with \u00f6 in a Unicode
string; it won't try to convert one into the other.

Regards,
Martin

--
http://mail.python.org/mailman/listinfo/python-list
Nov 27 '07 #4

P: n/a
tvn
Please see the correction from Cliff pasted here after this excerpt.
Tim
the byte string is ASCII which is a subset of Unicode (IS0-8859-1
isn't).)
The one comment I'd make is that ASCII and ISO-8859-1 are both subsets
of Unicode, (which relates to the abstract code-points) but ASCII is
also a subset of UTF-8, on the bytestream level, while ISO-8859 is not
a
subset of UTF-8, nor, as far as I can tell, any other unicode
*encoding*.

Thus a file encoded in ascii *is* in fact a utf-8 file. There is no
way
to distinguish the two. But an ISO-8859-1 file is not the same (on
the
bytestream level) as a file with identical content in UTF-8 or any
other
unicode encoding.
Dec 9 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.