Sorry for the multipost - I forgot to crosspost and alt.php gets less
attention than comp.lang.php... And I hope this will work with UTF-8.
In order to make strings suitable for URLs in a UTF-8 encoded website, I use
2 functions, the first of which removes accents from some Latin-1, Latin-2,
and Turkish characters (suggestions for changes or additions welcome!), and
the second removes non-word characters by spaces and then urlencode()s the
string:
function remove_accents($string, $german=false) {
// Single characters
$single_fr = explode(" ", "À Á Â Ã Ä Å A A Ç C C D D Ð È É Ê Ë E E G Ì Í
Î Ï I L L L Ñ N N Ò Ó Ô Õ Ö Ø O R R S S S T T Ù Ú Û Ü U U Ý Z Z Z à á â ã ä
å a a ç c c d d è é ê ë e e g ì í î ï i l l l ñ n n ð ò ó ô õ ö ø o r r s s
s t t ù ú û ü u u ý ÿ z z z");
$single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I
I I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a
a a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s
s t t u u u u u u y y z z z");
$single = array();
for ($i=0; $i<count($single_fr); $i++) {
$single[$single_fr[$i]] = $single_to[$i];
}
// Ligatures
$ligatures = array("Æ"=>"Ae", "æ"=>"ae", "O"=>"Oe", "o"=>"oe",
"ß"=>"ss");
// German umlauts
$umlauts = array("Ä"=>"Ae", "ä"=>"ae", "Ö"=>"Oe", "ö"=>"oe", "Ü"=>"Ue",
"ü"=>"ue");
// Replace
$replacements= array_merge($single, $ligatures);
if ($german) $replacements= array_merge($replacements, $umlauts);
$string = strtr($string, $replacements);
return $string;
}
function make_url_string($string) {
$string = strtolower(remove_accents($string, true));
$string = preg_replace("/([\W]+)/", "-", $string);
return urlencode(trim($string, "-"));
}
I have 2 questions on this:
1. preg_replace("/([\W]+)/", "-", $string); removes all non-ASCII
characters. Is there any possibility to remove only punctuation and such
stuff, but keep all kinds of letters from whatever character sets?
2. Is there a better way to encode strings for URLs? Or is it maybe
inevitable to collect the real name and the name for the url separately to
get an ASCII-only entry?
Thanks for suggestions!
--
Markus