By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,156 Members | 1,026 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,156 IT Pros & Developers. It's quick & easy.

unicode "em space" in regex

P: n/a
how to represent the unicode "em space" in regex?

e.g. i want do something like this:

fracture=re.split(r'\342371*\|\342371*',myline,re. U)

Xah
xa*@xahlee.org
http://xahlee.org/

Jul 19 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
Xah Lee :
how to represent the unicode "em space" in regex?

e.g. i want do something like this:

fracture=re.split(r'\342371*\|\342371*',myline,re. U)


I'm not sure what you're trying to do, but would it help you to use
it's name:
EM_SPACE = u'\N{EM SPACE}'
fracture = myline.split(EM_SPACE)


?

Cheers,

--
Klaus Alexander Seistrup
Magnetic Ink, Copenhagen, Denmark
http://magnetic-ink.dk/
Jul 19 '05 #2

P: n/a
Xah Lee wrote:
how to represent the unicode "em space" in regex?


You will have to pass a Unicode literal as the regular expression,
e.g.

fracture=re.split(u'\u2003*\\|\u2003*',myline,re.U )

Notice that, in raw Unicode literals, you can still use \u to
escape characters, e.g.

fracture=re.split(ur'\u2003*\|\u2003*',myline,re.U )

Regards,
Martin
Jul 19 '05 #3

P: n/a
Thanks. Is it true that any unicode chars can also be used inside regex
literally?

e.g.
re.search(ur' +',mystring,re.U)

I tested this case and apparently i can. But is it true that any
unicode char can be embedded in regex literally. (does this apply to
the esoteric ones such as other non-printing chars and combining
forms...)

----
Related...:

The official python doc:
http://python.org/doc/2.4.1/lib/module-re.html
says:

"Regular expression pattern strings may not contain null bytes, but can
specify the null byte using the \number notation."

What is meant by null bytes here? Unprintable chars?? and the "\number"
is meant to be decimal? and in what encoding?

Xah
xa*@xahlee.org
http://xahlee.org/

Jul 19 '05 #4

P: n/a
Xah Lee wrote:
"Regular expression pattern strings may not contain null bytes, but can
specify the null byte using the \number notation."

What is meant by null bytes here? Unprintable chars?? and the "\number"
is meant to be decimal? and in what encoding?


The null byte is a byte with the integer value 0. Difficult, isn't it.

The \number notation is, as you could read in http://docs.python.org/ref/strings.html,
octal.

Reinhold
Jul 19 '05 #5

P: n/a
Xah Lee wrote:
"Regular expression pattern strings may not contain null bytes, but can
specify the null byte using the \number notation."

What is meant by null bytes here? Unprintable chars??
no, null bytes. "\0". "\x00". ord(byte) == 0. chr(0).
and the "\number" is meant to be decimal?
octal. this is explained on the "Regular Expression Syntax" page.
and in what encoding?


null byte encoding? you're confused.

</F>

Jul 19 '05 #6

P: n/a
Xah Lee wrote:
Thanks. Is it true that any unicode chars can also be used inside regex
literally?

e.g.
re.search(ur' +',mystring,re.U)

I tested this case and apparently i can.
Yes. In fact, when you write u"\u2003" or u" " doesn't matter
to re.search. Either way you get a Unicode object with U+2003
in it, which is processed by SRE.
But is it true that any
unicode char can be embedded in regex literally. (does this apply to
the esoteric ones such as other non-printing chars and combining
forms...)


Yes. To SRE, only the Unicode ordinal values matter. To determine
whether something matches, it needs to have the same ordinal value
in the string that you have in the expression. No interpretation
of the character is performed, except for the few characters that
have markup meaning in regular expressions (e.g. $, \, [, etc)

Regards,
Martin
Jul 19 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.