473,324 Members | 2,257 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,324 software developers and data experts.

Unicode strings

Hello all,

What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and
"reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Many thanks,

Andrew
Jul 22 '05 #1
10 2314
On Tue, 22 Jun 2004 13:12:33 +0100, Andrew L
<an************@operamail.com> wrote:
Hello all,

What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and
"reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Many thanks,

Andrew


You have to implement some kind of lookup table or dictionary.

Although STL supports locales, I don't think there is a way of
comparing two strings in *different* locales ... especially for
Unicode strings, since there is no locale for Unicode -- Unicode
covers *all* locales.

Also, there are many words which mean one thing in one language (or
locale) and something else in a different language, although they are
spelled exactly the same. "Band" in German might be a different word
than "band" in English, for example.

Even if you get rid of the special characters, you must really watch
out (e.g. German "Präservative" and English "preservative" <g>).
--
Bob Hairgrove
No**********@Home.com
Jul 22 '05 #2
Bob Hairgrove wrote:
[snip]
Even if you get rid of the special characters, you must really watch
out (e.g. German "Präservative" and English "preservative" <g>).


One of the most puzzeling words for german speaking english students
is the word: eventually. There is a very similar word in German: eventuell,
which means: maybe. But eventually means: finally

--
Karl Heinz Buchegger
kb******@gascad.at
Jul 22 '05 #3
Bob Hairgrove wrote:
You have to implement some kind of lookup table or dictionary.

Although STL supports locales, I don't think there is a way of
comparing two strings in *different* locales ... especially for
Unicode strings, since there is no locale for Unicode -- Unicode
covers *all* locales.
This is what I suspected. Many thanks for that. Now, I wonder if such a
dictionary has already been implemented?
Also, there are many words which mean one thing in one language (or
locale) and something else in a different language, although they are


This won't really be a problem because the strings I'm dealing with are
geographical elements - placenames etc. I've dealt with localised versions
of these (eg Koln and Cologne are equivalent) it's essentially just a
problem with accents.

Many thanks,

Andrew
Jul 22 '05 #4

"Andrew L" <an************@operamail.com> wrote in message
news:cb*******************@news.demon.co.uk...
Hello all,

What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and "reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Many thanks,


Well, were it my job, I'd just put all words into a big matching dictionary,
and
each entry has a pointer (or index) to the first occurring synonym. The
first synonym
can point to null or to itself.

Or if you want to be fancy, all the synonyms can point to the previous
synonym, and
the first point to the last. This would allow you to print all synonyms to
a given word.

Although you have to be careful. For city names, you may not have a
problem, but
for other types of synonym use you have to be careful. Just because A is
synonymous
with B, and C is synonymous with B, does not imply A is synonymous with C.
(B may
change meanings depending on the comparison).

Rufus
Jul 22 '05 #5
Ips

"Karl Heinz Buchegger" <kb******@gascad.at> wrote in message
news:40***************@gascad.at...
Bob Hairgrove wrote:
[snip]

Even if you get rid of the special characters, you must really watch
out (e.g. German "Präservative" and English "preservative" <g>).


One of the most puzzeling words for german speaking english students
is the word: eventually. There is a very similar word in German:

eventuell, which means: maybe. But eventually means: finally

The same in Polish. 'ewentualnie' means 'maybe'
Other puzzling example 'aktualnie' (like English 'actually') means
'currently' - very often misused word.
Sorry for off-topic.

regards,
Ips
Jul 22 '05 #6
On Tue, 22 Jun 2004 13:49:10 +0100, Andrew L
<an************@operamail.com> wrote:
This won't really be a problem because the strings I'm dealing with are
geographical elements - placenames etc. I've dealt with localised versions
of these (eg Koln and Cologne are equivalent) it's essentially just a
problem with accents.


And watch out for Paris, France vs. Paris, Texas; Moscow, Idaho vs.
Moscow, Russia; ad infinitem...
--
Bob Hairgrove
No**********@Home.com
Jul 22 '05 #7
Andrew L posted:
Hello all,

What strategy should I use in solving the following problem? I have a
list of unicode strings which I would like to compare with its English
language 'equivalent.' eg
8-Bit chars will suffice.
"reykjavík" (note the accent above the i) should match both "reykjavík"
and "reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted
a's, o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?


Here's a function that checks if all the 'ß' in the German one are equal to
's','s' in the second English one;

bool Compare(const char* pGerman, const char* pEnglish)
{
const char* pGermanTemp = pGerman;

const char* pEnglishTemp = pEnglish;

for ( ; ; )
{
if (*pGermanTemp != 'ß') continue;

if (*pEnglishTemp != 's')
{
return false;
}
else
{
if (*++pEnglishTemp != 's') return false;
}

++pGermanTemp;
++pEnglishTemp;
}
//Reset the pointers and perform another test:

pGermanTemp = pGerman;

pEnglishTemp = pEnglish;
return true;

}
Or you could go through it charachter by character and perform tests based
upon each character, it'd be faster that way too.
-JKop
Jul 22 '05 #8
> What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and "reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?


I don't think the c++ standard libraries will help you here. You need to do
unicode normalization and comparison.

Here are a few hints:

http://oss.software.ibm.com/icu/
(open source)

http://www.roguewave.com/support/doc.../i18nug/5.html
(I think this one is commercial)

http://www.unicode.org/unicode/reports/tr15/
( the spec )

Of course there are many more libs out there, just google around.

Greetings from Bonn, Germany
Meikel Weber
http://www.meikel.com
Jul 22 '05 #9
CFG
I'm affraid C++ stdlib will be of little help.

Take a look at ICU library. They have related functionality.
1. Transliteration
http://oss.software.ibm.com/icu/user...Transform.html
2. Language specific case mapping
http://oss.software.ibm.com/icu/user...#lang_specific
3. Unicode string normalization:
http://oss.software.ibm.com/icu/user...alization.html

But the task is more difficult than it might look at first. What you are
trying to do is locale and language dependent and may require some
linguistic knowledge or even consulting the dictionary. For instance, things
like inflected forms, handling compound words in foreign language, matching
spelling variations such as "organization" and "organisation", and so forth,
and so on.

Simple & general character manipulation algorithms/rules will not be able to
handle it right.

You've been warned.

Jul 22 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: ..... | last post by:
I have an established program that I am changing to allow users to select one of eight languages and have all the label captions change accordingly. I have no problems with English, French, Dutch,...
3
by: Michael Weir | last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...
4
by: Guilherme Salgado | last post by:
Hi there, I have a python source file encoded in unicode(utf-8) with some iso8859-1 strings. I've encoded this file as utf-8 in the hope that python will understand these strings as unicode...
6
by: nico | last post by:
In my python scripts, I use a lot of accented characters as I work in french. In order to do this, I put the line # -*- coding: UTF-8 -*- at the beginning of the script file. Then, when I need...
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
2
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...
5
by: Jamie | last post by:
I have a file that was written using Java and the file has unicode strings. What is the best way to deal with these in C? The file definition reads: Data Field Description CHAR File...
13
by: gabor | last post by:
hi, from the documentation (http://docs.python.org/lib/os-file-dir.html) for os.listdir: "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result will be a list of Unicode...
24
by: Donn Ingle | last post by:
Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything...
13
by: George Sakkis | last post by:
It seems xml.etree.cElementTree.iterparse() is not unicode aware: .... print elem.text .... Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 64,...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.