473,406 Members | 2,698 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Unicode strings and ascii regular expressions

Hello all,

Can someone confirm that compiled regular expressions from ascii
strings will always (and safely) yield unicode values when matched
against unicode strings ?

I've tested it and it works - but can someone confirm that this is
consistent and safe ? (No lurking encode errors - I assume it is only a
decode that is done, in which case is it safe on a system that has a
non-ascii compatible default encoding ? OTOH it would seem to me that
that would break *everything*.)
import re
r = re.compile('(.*)=(.*)')
s = '£££=£££'.decode('cp1252') # yields a unicode string that can't be encoded as ascii
c = r.match(s)
c.groups() # yields two unicode strings (u'\xa3\xa3\xa3', u'\xa3\xa3\xa3') print c.groups()[0].encode('cp1252') # which encode safely

£££
All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Jan 30 '06 #1
2 2361
Fuzzyman wrote:
Can someone confirm that compiled regular expressions from ascii
strings will always (and safely) yield unicode values when matched
against unicode strings ?

I've tested it and it works - but can someone confirm that this is
consistent and safe ? (No lurking encode errors - I assume it is only a
decode that is done, in which case is it safe on a system that has a
non-ascii compatible default encoding ? OTOH it would seem to me that
that would break *everything*.)
import re
r = re.compile('(.*)=(.*)')
s = '£££=£££'.decode('cp1252') # yields a unicode string that can't be encoded as ascii
c = r.match(s)
c.groups() # yields two unicode strings (u'\xa3\xa3\xa3', u'\xa3\xa3\xa3') print c.groups()[0].encode('cp1252') # which encode safely

£££


ascii patterns work just fine on unicode strings. the engine doesn't care
what string type you use for the pattern, and it always returns slices of
the target string, so you get back what you pass in.

</F>

Jan 30 '06 #2

Fredrik Lundh wrote:
Fuzzyman wrote:
Can someone confirm that compiled regular expressions from ascii
strings will always (and safely) yield unicode values when matched
against unicode strings ?
[snip..]
ascii patterns work just fine on unicode strings. the engine doesn't care
what string type you use for the pattern, and it always returns slices of
the target string, so you get back what you pass in.

Thanks - that's what I hoped. :-)

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
</F>


Jan 31 '06 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
32
by: Wolfgang Draxinger | last post by:
I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size()....
18
by: Ger | last post by:
I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found...
1
by: olsongt | last post by:
I was going to submit to sourceforge, but my unicode skills are weak. I was trying to strip characters from a string that contained values outside of ASCII. I though I could just encode as 'ascii'...
24
by: ChaosKCW | last post by:
Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special...
29
by: Ron Garret | last post by:
>>> u'\xbd' u'\xbd' >>> print _ Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in...
14
by: Dennis Benzinger | last post by:
Hi! The following program in an UTF-8 encoded file: # -*- coding: UTF-8 -*- FIELDS = ("Fächer", ) FROZEN_FIELDS = frozenset(FIELDS) FIELDS_SET = set(FIELDS)
2
by: John Nagle | last post by:
Regular expressions are compiled in ASCII mode unless Unicode mode is specified to "rc.compile". The difference is that regular expressions in ASCII mode don't recognize things like Unicode...
7
by: 7stud | last post by:
Based on this example and the error: ----- u_str = u"abc\u9999" print u_str UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in position 3: ordinal not in range(128) ------
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.