467,166 Members | 1,041 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,166 developers. It's quick & easy.

Regular expression: non-latin word/non-word characters and UTF-8

Hi

I wrote a function that "normalizes" strings for use in URLs in a UTF-8
encoded content administration application. After having removed the accents
from latin characters I try to remove all non-word characters from the
string:

// PCRE syntax:
$string = preg_replace("/([\W]+)/", "-", $string);

// POSIX alternative (mb_string is on):
$string = ereg_replace("[^[:alnum:]]+", "-", $string);

// post-process and return
return urlencode(trim($string, "-"));

Both ways work but remove all non-latin characters. But what I want to do is
remove only the non-word characters of whatever languages, and keep all word
characters regardless if they are Japanese, Hebrew, Arab, Latin or whatever.

Is there a way for a Regex to recognize non-latin word/non-word characters?
Or do I have to manually specify all the characters to be removed?

Thanks for every hint
Markus
Sep 22 '05 #1
  • viewed: 6377
Share:
1 Reply
i suppose your *other* post didn't supply you with an agreeable answer and
that *re-posting* the same question will. well, guess what? i doubt
re-posting it will find you an any more agreeable donor.

....but that's just me thinking out loud.
"Markus Ernst" <derernst@NO#SP#AMgmx.ch> wrote in message
news:43**********@news.cybercity.ch...
| Hi
|
| I wrote a function that "normalizes" strings for use in URLs in a UTF-8
| encoded content administration application. After having removed the
accents
| from latin characters I try to remove all non-word characters from the
| string:
|
| // PCRE syntax:
| $string = preg_replace("/([\W]+)/", "-", $string);
|
| // POSIX alternative (mb_string is on):
| $string = ereg_replace("[^[:alnum:]]+", "-", $string);
|
| // post-process and return
| return urlencode(trim($string, "-"));
|
| Both ways work but remove all non-latin characters. But what I want to do
is
| remove only the non-word characters of whatever languages, and keep all
word
| characters regardless if they are Japanese, Hebrew, Arab, Latin or
whatever.
|
| Is there a way for a Regex to recognize non-latin word/non-word
characters?
| Or do I have to manually specify all the characters to be removed?
|
| Thanks for every hint
| Markus
|
|
Sep 22 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Bradley Plett | last post: by
9 posts views Thread by MJ | last post: by
18 posts views Thread by Q. John Chen | last post: by
2 posts views Thread by Sehboo | last post: by
3 posts views Thread by James D. Marshall | last post: by
7 posts views Thread by Billa | last post: by
9 posts views Thread by Pete Davis | last post: by
25 posts views Thread by Mike | last post: by
5 posts views Thread by shawnmkramer@comcast.net | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.