473,509 Members | 3,095 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

unicode "em space" in regex

how to represent the unicode "em space" in regex?

e.g. i want do something like this:

fracture=re.split(r'\342371*\|\342371*',myline,re. U)

Xah
xa*@xahlee.org
∑ http://xahlee.org/

Jul 19 '05 #1
6 2412
Xah Lee :
how to represent the unicode "em space" in regex?

e.g. i want do something like this:

fracture=re.split(r'\342371*\|\342371*',myline,re. U)


I'm not sure what you're trying to do, but would it help you to use
it's name:
EM_SPACE = u'\N{EM SPACE}'
fracture = myline.split(EM_SPACE)


?

Cheers,

--
Klaus Alexander Seistrup
Magnetic Ink, Copenhagen, Denmark
http://magnetic-ink.dk/
Jul 19 '05 #2
Xah Lee wrote:
how to represent the unicode "em space" in regex?


You will have to pass a Unicode literal as the regular expression,
e.g.

fracture=re.split(u'\u2003*\\|\u2003*',myline,re.U )

Notice that, in raw Unicode literals, you can still use \u to
escape characters, e.g.

fracture=re.split(ur'\u2003*\|\u2003*',myline,re.U )

Regards,
Martin
Jul 19 '05 #3
Thanks. Is it true that any unicode chars can also be used inside regex
literally?

e.g.
re.search(ur' +',mystring,re.U)

I tested this case and apparently i can. But is it true that any
unicode char can be embedded in regex literally. (does this apply to
the esoteric ones such as other non-printing chars and combining
forms...)

----
Related...:

The official python doc:
http://python.org/doc/2.4.1/lib/module-re.html
says:

"Regular expression pattern strings may not contain null bytes, but can
specify the null byte using the \number notation."

What is meant by null bytes here? Unprintable chars?? and the "\number"
is meant to be decimal? and in what encoding?

Xah
xa*@xahlee.org
∑ http://xahlee.org/

Jul 19 '05 #4
Xah Lee wrote:
"Regular expression pattern strings may not contain null bytes, but can
specify the null byte using the \number notation."

What is meant by null bytes here? Unprintable chars?? and the "\number"
is meant to be decimal? and in what encoding?


The null byte is a byte with the integer value 0. Difficult, isn't it.

The \number notation is, as you could read in http://docs.python.org/ref/strings.html,
octal.

Reinhold
Jul 19 '05 #5
Xah Lee wrote:
"Regular expression pattern strings may not contain null bytes, but can
specify the null byte using the \number notation."

What is meant by null bytes here? Unprintable chars??
no, null bytes. "\0". "\x00". ord(byte) == 0. chr(0).
and the "\number" is meant to be decimal?
octal. this is explained on the "Regular Expression Syntax" page.
and in what encoding?


null byte encoding? you're confused.

</F>

Jul 19 '05 #6
Xah Lee wrote:
Thanks. Is it true that any unicode chars can also be used inside regex
literally?

e.g.
re.search(ur' +',mystring,re.U)

I tested this case and apparently i can.
Yes. In fact, when you write u"\u2003" or u" " doesn't matter
to re.search. Either way you get a Unicode object with U+2003
in it, which is processed by SRE.
But is it true that any
unicode char can be embedded in regex literally. (does this apply to
the esoteric ones such as other non-printing chars and combining
forms...)


Yes. To SRE, only the Unicode ordinal values matter. To determine
whether something matches, it needs to have the same ordinal value
in the string that you have in the expression. No interpretation
of the character is performed, except for the few characters that
have markup meaning in regular expressions (e.g. $, \, [, etc)

Regards,
Martin
Jul 19 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
11990
by: Petr Jakes | last post by:
Hi, I am trying to set-up communication to the coin change-giver from my Linux box using the Python code. The change giver uses MDB (Multi Drop Bus) serial protocol to communicate with the...
0
1550
by: serge | last post by:
I have an XML file with an XSD file. When i am reading the XML file and putting its content into a DataSet I notice that the field that has a single space character is being converted to a field...
38
3749
by: axlq | last post by:
I'm trying to figure out how to display a box that has a width in "em" units. So far no luck. Below is some HTML that displays two rows of 30 'm' characters in lowercase and uppercase, followed...
2
3774
by: Kannan | last post by:
In section 7.19.6.1 on ISO C specification, it says about a format character "space". When and how we will use this "space" format? I mean is there a usage, for e.g. like printf("%space",...); ? ...
1
2004
by: pippyn | last post by:
I'm programming for a CE device with reduced key board, no mouse, and no touch screen. As such, I have to tab from button to button to get the focus on the button i want. Then to invoke the button...
0
7136
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7344
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
1
7069
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7505
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5652
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
5060
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4730
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3216
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
441
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.