Best ways of managing text encodings in source/regexes?

tinkerbarbet

Hi

I've read around quite a bit about Unicode and python's support for
it, and I'm still unclear about how it all fits together in certain
scenarios. Can anyone help clarify?

* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)

* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)? This seems inevitable given that
standard library modules such as re don't declare an encoding,
presumably because in that case I don't see any non-ASCII chars in the
source.

* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

I've been trying to understand this for a while so any clarification
would be a great help.

Tim

Nov 26 '07 #1

Subscribe Post Reply

1785

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving

the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)

Depends on what you want to achieve. If you don't prefix your strings
with u, they will stay byte string objects, and won't become Unicode
strings. That should be fine for strings that are pure ASCII; for
ISO-8859-1 strings, I recommend it is safer to only use Unicode
objects to represent such strings.

In Py3k, that will change - string literals will automatically be
Unicode objects.

* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)?

Yes, it will. The encoding declaration is per-module.

* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

It will work fine if the regular expression restricts itself to ASCII,
and doesn't rely on any of the locale-specific character classes (such
as \w). If it's beyond ASCII, or does use such escapes, you better make
it a Unicode expression.

I'm not actually sure what precisely the semantics is when you match
an expression compiled from a byte string against a Unicode string,
or vice versa. I believe it operates on the internal representation,
so \xf6 in a byte string expression matches with \u00f6 in a Unicode
string; it won't try to convert one into the other.

Regards,
Martin

Nov 26 '07 #2

tinkerbarbet

On Nov 27, 12:27 am, "Martin v. Löwis" <mar...@v.loewis.dewrote:

* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)

Depends on what you want to achieve. If you don't prefix your strings
with u, they will stay byte string objects, and won't become Unicode
strings. That should be fine for strings that are pure ASCII; for
ISO-8859-1 strings, I recommend it is safer to only use Unicode
objects to represent such strings.

In Py3k, that will change - string literals will automatically be
Unicode objects.

* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)?

Yes, it will. The encoding declaration is per-module.

* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

It will work fine if the regular expression restricts itself to ASCII,
and doesn't rely on any of the locale-specific character classes (such
as \w). If it's beyond ASCII, or does use such escapes, you better make
it a Unicode expression.

I'm not actually sure what precisely the semantics is when you match
an expression compiled from a byte string against a Unicode string,
or vice versa. I believe it operates on the internal representation,
so \xf6 in a byte string expression matches with \u00f6 in a Unicode
string; it won't try to convert one into the other.

Regards,
Martin

Thanks Martin, that's a very helpful response to what I was concerned
might be an overly long query.

Yes, I'd read that in Py3k the distinction between byte strings and
Unicode strings would disappear -- I look forward to that...

Tim

Nov 26 '07 #3

Kumar McMillan

On Nov 26, 2007 4:27 PM, "Martin v. Löwis" <ma****@v.loewis.dewrote:

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

It will work fine if the regular expression restricts itself to ASCII,
and doesn't rely on any of the locale-specific character classes (such
as \w). If it's beyond ASCII, or does use such escapes, you better make
it a Unicode expression.

yes, you have to be careful when writing unicode-senstive regular expressions:
http://effbot.org/zone/unicode-objects.htm

"You can apply the same pattern to either 8-bit (encoded) or Unicode
strings. To create a regular expression pattern that uses Unicode
character classes for \w (and \s, and \b), use the "(?u)" flag prefix,
or the re.UNICODE flag:

pattern = re.compile("(?u)pattern")
pattern = re.compile("pattern", re.UNICODE)

"

>
I'm not actually sure what precisely the semantics is when you match
an expression compiled from a byte string against a Unicode string,
or vice versa. I believe it operates on the internal representation,
so \xf6 in a byte string expression matches with \u00f6 in a Unicode
string; it won't try to convert one into the other.

Regards,
Martin

--
http://mail.python.org/mailman/listinfo/python-list

Nov 27 '07 #4

tvn

Please see the correction from Cliff pasted here after this excerpt.
Tim

the byte string is ASCII which is a subset of Unicode (IS0-8859-1
isn't).)

The one comment I'd make is that ASCII and ISO-8859-1 are both subsets
of Unicode, (which relates to the abstract code-points) but ASCII is
also a subset of UTF-8, on the bytestream level, while ISO-8859 is not
a
subset of UTF-8, nor, as far as I can tell, any other unicode
*encoding*.

Thus a file encoded in ascii *is* in fact a utf-8 file. There is no
way
to distinguish the two. But an ISO-8859-1 file is not the same (on
the
bytestream level) as a file with identical content in UTF-8 or any
other
unicode encoding.

Dec 9 '07 #5

Similar topics

PEP 263 status check

by: John Roth | last post by:

PEP 263 is marked finished in the PEP index, however I haven't seen the specified Phase 2 in the list of changes for 2.4 which is when I expected it. Did phase 2 get cancelled, or is it just not...

Python

best way to post data from a javascript to a servlet

by: Andy Fish | last post by:

Hi, I have a webpage with some javascript that constructs some data structures, some of which contain nested structures or arrays of other structures. It's the same order of complexity as a...

Javascript

Object test = new Object() <-- Java, best way in C++

by: DrUg13 | last post by:

In java, this seems so easy. You need a new object Object test = new Object() gives me exactly what I want. could someone please help me understand the different ways to do the same thing in...

C / C++

best practices managing js obfuscated code?

by: tgh003 | last post by:

I would be interested to hear how others are managing their javascript (.js) files from the original code vs the obfuscated version they publish to their site/webapp. I currently manage 2 files,...

Javascript

Rqst for Inventory Database Best Practices

by: DeepDiver | last post by:

I am developing an inventory database in SQL Server. I realize there are many commercial (as well as some non-commercial) inventory offerings, but my client has specific requirements that would...

Microsoft SQL Server

Using a regular expression to retrieve the text between two parentheses

by: Mark Rae | last post by:

Hi, Supposing I had a string made up of a person's name followed by their profession in parentheses e.g. string strText = "Tiger Woods (golfer)"; and I wanted to extract the portion of the...

C# / C Sharp

Best ways to source controlling "Code Snippets"

by: frk.won | last post by:

I am interested in learning how to use the VS 2005 code snippets. However, I wish to know what are the best ways to source control the code snippets? Are there any source safe/subversion...

.NET Framework

Binary or text file

by: list | last post by:

Hi folks, I am new to Googlegroups. I asked my questions at other forums, since now. I have an important question: I have to check files if they are binary(.bmp, .avi, .jpg) or text(.txt,...

C / C++

Re: Where to locate existing standard encodings in python

by: Philip Semanchuk | last post by:

On Nov 9, 2008, at 7:00 PM, News123 wrote: Look under the heading "Standard Encodings": http://docs.python.org/library/codecs.html Note that both the page you found (which appears to be a...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA