By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,908 Members | 2,041 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,908 IT Pros & Developers. It's quick & easy.

Regular expression for matching IPA characters in Unicode?

P: n/a
Hi Pythoneers,

Which is the best way of checking that a given unicode string only
contains IPA characters, e.g. characters in the range \u0250-\u02AF?
I guess a regular expression would do it, just can't figure out how to
implement that expression.

Code snippets are most welcome.

Best regards,

Mickel Grönroos

--
Mickel Grönroos, application specialist, linguistics, CSC
PL 405 (Tekniikantie 15 a D), 02101 Espoo, Finland,
CSC is the Finnish IT center for science, www.csc.fi
Jul 18 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Mickel Grönroos wrote:
Which is the best way of checking that a given unicode string only
contains IPA characters, e.g. characters in the range \u0250-\u02AF?


Well, I'll give you an example that only includes characters in the
range [\u0250, \u02AF] but those are just the IPA *extensions.* You also
need to include basic latin and greek characters from other blocks.

See: http://www.unicode.org/charts/PDF/U0250.pdf

And why do you want to do this anyway?

This example uses the itertools example all() which tells you whether a
predicate is true for every item in an iterable. The predicate here is
whether the item is contained in IPA_CHARS, which you can expand...

=====

import itertools
from sets import Set # set() is a built-in in 2.4

IPA_CHARS = Set(map(unichr, xrange(0x250, 0x2b0)))

def all(seq, pred=bool):
# http://www.python.org/doc/current/li...s-example.html
"Returns True if pred(x) is True for every element in the iterable"
return False not in itertools.imap(pred, seq)

def is_ipa(iterable):
return all(iterable, IPA_CHARS.__contains__)

print is_ipa(u"aeiou") # this is valid IPA, but not in the extensions block
print is_ipa(u"\u0260\u02af") # valid IPA in the extensions block

====output===

False
True
--
Michael Hoffman
Jul 18 '05 #2

P: n/a
Mickel Grönroos wrote:
Which is the best way of checking that a given unicode string only
contains IPA characters, e.g. characters in the range \u0250-\u02AF?


The regular expression for that is [\u0250-\u02AF]. You can either make
the regular expression a Unicode string itself, or you can make it a
normal (byte) string, and put the backslash-u-number sequence into it
(e.g. with double-backslash quotation).

Regards,
Martin
Jul 18 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.