Connecting Tech Pros Worldwide Forums | Help | Site Map

Unicode strings

Andrew L
Guest
 
Posts: n/a
#1: Jul 22 '05
Hello all,

What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and
"reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Many thanks,

Andrew

Bob Hairgrove
Guest
 
Posts: n/a
#2: Jul 22 '05

re: Unicode strings


On Tue, 22 Jun 2004 13:12:33 +0100, Andrew L
<andrew_on_tour@operamail.com> wrote:
[color=blue]
>Hello all,
>
>What strategy should I use in solving the following problem? I have a list
>of unicode strings which I would like to compare with its English language
>'equivalent.' eg
>
>"reykjavík" (note the accent above the i) should match both "reykjavík" and
>"reykjavik" (being the English equivalent).
>
>Similarly the German language letter 'ß' should match "ss", umlauted a's,
>o's etc should match a,o etc.
>
>How would I go about doing this using the c++ stdlib?
>
>Many thanks,
>
>Andrew[/color]

You have to implement some kind of lookup table or dictionary.

Although STL supports locales, I don't think there is a way of
comparing two strings in *different* locales ... especially for
Unicode strings, since there is no locale for Unicode -- Unicode
covers *all* locales.

Also, there are many words which mean one thing in one language (or
locale) and something else in a different language, although they are
spelled exactly the same. "Band" in German might be a different word
than "band" in English, for example.

Even if you get rid of the special characters, you must really watch
out (e.g. German "Präservative" and English "preservative" <g>).


--
Bob Hairgrove
NoSpamPlease@Home.com
Karl Heinz Buchegger
Guest
 
Posts: n/a
#3: Jul 22 '05

re: Unicode strings


Bob Hairgrove wrote:[color=blue]
>[/color]
[snip][color=blue]
>
> Even if you get rid of the special characters, you must really watch
> out (e.g. German "Präservative" and English "preservative" <g>).[/color]

One of the most puzzeling words for german speaking english students
is the word: eventually. There is a very similar word in German: eventuell,
which means: maybe. But eventually means: finally

--
Karl Heinz Buchegger
kbuchegg@gascad.at
Andrew L
Guest
 
Posts: n/a
#4: Jul 22 '05

re: Unicode strings


Bob Hairgrove wrote:[color=blue]
> You have to implement some kind of lookup table or dictionary.
>
> Although STL supports locales, I don't think there is a way of
> comparing two strings in *different* locales ... especially for
> Unicode strings, since there is no locale for Unicode -- Unicode
> covers *all* locales.[/color]

This is what I suspected. Many thanks for that. Now, I wonder if such a
dictionary has already been implemented?
[color=blue]
> Also, there are many words which mean one thing in one language (or
> locale) and something else in a different language, although they are[/color]

This won't really be a problem because the strings I'm dealing with are
geographical elements - placenames etc. I've dealt with localised versions
of these (eg Koln and Cologne are equivalent) it's essentially just a
problem with accents.

Many thanks,

Andrew
Rufus V. Smith
Guest
 
Posts: n/a
#5: Jul 22 '05

re: Unicode strings



"Andrew L" <andrew_on_tour@operamail.com> wrote in message
news:cb97lt$3qr$1$8302bc10@news.demon.co.uk...[color=blue]
> Hello all,
>
> What strategy should I use in solving the following problem? I have a list
> of unicode strings which I would like to compare with its English language
> 'equivalent.' eg
>
> "reykjavík" (note the accent above the i) should match both "reykjavík"[/color]
and[color=blue]
> "reykjavik" (being the English equivalent).
>
> Similarly the German language letter 'ß' should match "ss", umlauted a's,
> o's etc should match a,o etc.
>
> How would I go about doing this using the c++ stdlib?
>
> Many thanks,
>[/color]

Well, were it my job, I'd just put all words into a big matching dictionary,
and
each entry has a pointer (or index) to the first occurring synonym. The
first synonym
can point to null or to itself.

Or if you want to be fancy, all the synonyms can point to the previous
synonym, and
the first point to the last. This would allow you to print all synonyms to
a given word.

Although you have to be careful. For city names, you may not have a
problem, but
for other types of synonym use you have to be careful. Just because A is
synonymous
with B, and C is synonymous with B, does not imply A is synonymous with C.
(B may
change meanings depending on the comparison).

Rufus


Ips
Guest
 
Posts: n/a
#6: Jul 22 '05

re: Unicode strings



"Karl Heinz Buchegger" <kbuchegg@gascad.at> wrote in message
news:40D8267C.27BF17AF@gascad.at...[color=blue]
> Bob Hairgrove wrote:[color=green]
> >[/color]
> [snip][color=green]
> >
> > Even if you get rid of the special characters, you must really watch
> > out (e.g. German "Präservative" and English "preservative" <g>).[/color]
>
> One of the most puzzeling words for german speaking english students
> is the word: eventually. There is a very similar word in German:[/color]
eventuell,[color=blue]
> which means: maybe. But eventually means: finally
>[/color]
The same in Polish. 'ewentualnie' means 'maybe'
Other puzzling example 'aktualnie' (like English 'actually') means
'currently' - very often misused word.
Sorry for off-topic.

regards,
Ips


Bob Hairgrove
Guest
 
Posts: n/a
#7: Jul 22 '05

re: Unicode strings


On Tue, 22 Jun 2004 13:49:10 +0100, Andrew L
<andrew_on_tour@operamail.com> wrote:
[color=blue]
>This won't really be a problem because the strings I'm dealing with are
>geographical elements - placenames etc. I've dealt with localised versions
>of these (eg Koln and Cologne are equivalent) it's essentially just a
>problem with accents.[/color]

And watch out for Paris, France vs. Paris, Texas; Moscow, Idaho vs.
Moscow, Russia; ad infinitem...


--
Bob Hairgrove
NoSpamPlease@Home.com
JKop
Guest
 
Posts: n/a
#8: Jul 22 '05

re: Unicode strings


Andrew L posted:
[color=blue]
> Hello all,
>
> What strategy should I use in solving the following problem? I have a
> list of unicode strings which I would like to compare with its English
> language 'equivalent.' eg[/color]

8-Bit chars will suffice.
[color=blue]
> "reykjavík" (note the accent above the i) should match both "reykjavík"
> and "reykjavik" (being the English equivalent).
>
> Similarly the German language letter 'ß' should match "ss", umlauted
> a's, o's etc should match a,o etc.
>
> How would I go about doing this using the c++ stdlib?[/color]

Here's a function that checks if all the 'ß' in the German one are equal to
's','s' in the second English one;

bool Compare(const char* pGerman, const char* pEnglish)
{
const char* pGermanTemp = pGerman;

const char* pEnglishTemp = pEnglish;

for ( ; ; )
{
if (*pGermanTemp != 'ß') continue;

if (*pEnglishTemp != 's')
{
return false;
}
else
{
if (*++pEnglishTemp != 's') return false;
}

++pGermanTemp;
++pEnglishTemp;
}


//Reset the pointers and perform another test:

pGermanTemp = pGerman;

pEnglishTemp = pEnglish;


return true;

}


Or you could go through it charachter by character and perform tests based
upon each character, it'd be faster that way too.


-JKop
Meikel Weber
Guest
 
Posts: n/a
#9: Jul 22 '05

re: Unicode strings


> What strategy should I use in solving the following problem? I have a list[color=blue]
> of unicode strings which I would like to compare with its English language
> 'equivalent.' eg
>
> "reykjavík" (note the accent above the i) should match both "reykjavík"[/color]
and[color=blue]
> "reykjavik" (being the English equivalent).
>
> Similarly the German language letter 'ß' should match "ss", umlauted a's,
> o's etc should match a,o etc.
>
> How would I go about doing this using the c++ stdlib?[/color]

I don't think the c++ standard libraries will help you here. You need to do
unicode normalization and comparison.

Here are a few hints:

http://oss.software.ibm.com/icu/
(open source)

http://www.roguewave.com/support/doc.../i18nug/5.html
(I think this one is commercial)

http://www.unicode.org/unicode/reports/tr15/
( the spec )

Of course there are many more libs out there, just google around.

Greetings from Bonn, Germany
Meikel Weber
http://www.meikel.com


CFG
Guest
 
Posts: n/a
#10: Jul 22 '05

re: Unicode strings


I'm affraid C++ stdlib will be of little help.

Take a look at ICU library. They have related functionality.
1. Transliteration
http://oss.software.ibm.com/icu/user...Transform.html
2. Language specific case mapping
http://oss.software.ibm.com/icu/user...#lang_specific
3. Unicode string normalization:
http://oss.software.ibm.com/icu/user...alization.html

But the task is more difficult than it might look at first. What you are
trying to do is locale and language dependent and may require some
linguistic knowledge or even consulting the dictionary. For instance, things
like inflected forms, handling compound words in foreign language, matching
spelling variations such as "organization" and "organisation", and so forth,
and so on.

Simple & general character manipulation algorithms/rules will not be able to
handle it right.

You've been warned.



CFG
Guest
 
Posts: n/a
#11: Jul 22 '05

re: Unicode strings


See also:

http://www.basistech.com/products/index.html

http://www-306.ibm.com/software/glob...ctionality.jsp



Closed Thread