By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,261 Members | 1,674 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,261 IT Pros & Developers. It's quick & easy.

Welsh language - ISO-8859-1 or Unicode ?

P: n/a
Hello -

I'm working on a team that is planning to add Welsh language support to a
large existing IT system which is partially web-based and
English-language-only so far. I've heard that 2 characters in Welsh
(w-circumflex and y-circumflex) are not supported in our default ISO-8859-1
character set, so a partial move to Unicode for internal storage of text
might be required.

I haven't yet found a Welsh-language website that uses these 2 characters,
so are they actually used much in Welsh? Is not supporting them likely to
cause problems?

Thanks
Jun 27 '08 #1
Share this Question
Share on Google+
25 Replies


P: n/a
"Simon" <ds*******@eeee.invalid.comwrote in message
news:48***********************@news.gradwell.net.. .
Hello -

I'm working on a team that is planning to add Welsh language support to a
large existing IT system which is partially web-based and
English-language-only so far. I've heard that 2 characters in Welsh
(w-circumflex and y-circumflex) are not supported in our default
ISO-8859-1
character set, so a partial move to Unicode for internal storage of text
might be required.

I haven't yet found a Welsh-language website that uses these 2 characters,
so are they actually used much in Welsh? Is not supporting them likely to
cause problems?

Thanks
I've just found a webpage that uses y-circumflex at the end of the third
paragraph, so it can't be that uncommon:
http://news.bbc.co.uk/welsh/hi/newsi...00/7462534.stm

This webpage uses ISO-8859-1 with entities for the y-circumflex. Using
entities would be very messy in my application, so if support for these
characters is needed, I would have to go for Unicode.
I guess my question still is: would not supporting these 2 characters be
considered bad practice for a Welsh-language business application?
Jun 27 '08 #2

P: n/a
Scripsit Simon:
I'm working on a team that is planning to add Welsh language support
to a large existing IT system which is partially web-based and
English-language-only so far.
Do you plan to add other languages later? Is this about names only or
also about prose texts? After all, ISO-8859-1 is insufficient even for
normal English prose; think about dashes and proper quotations marks.
I've heard that 2 characters in Welsh
(w-circumflex and y-circumflex) are not supported in our default
ISO-8859-1 character set,
Right. They are included in ISO-8859-14 (a.k.a. ISO Latin 8, or
"Celtic"), but thats not a feasible option on the WWW (IE does not
recognize that encoding).
so a partial move to Unicode for internal
storage of text might be required.
That might be easy, or it might be extremely complicated. But that's
really beyond the scope of these groups. As far as WWW authoring is
concerned, Unicode - specifically UTF-8 - is a good option, but you
could keep using ISO-8859-1 and represent those letters using character
references like ŵ for w with circumflex. But you might have to deal
with the encoding problem of the data bases involved, for example, and
with data entry.
I haven't yet found a Welsh-language website that uses these 2
characters, so are they actually used much in Welsh?
I don't know Welsh, but I expect those characters to be so rare that
using some clumsy notation like character references for them wouldn't
be a major problem.
Is not supporting them likely to cause problems?
Some people might say that it is tolerable to omit the circumflex, but
it may be distinctive (i.e. the only difference between otherwise
identical words, thought the context usually resolves the issue). And in
2008, I think it is inappropriate to add support to languages to IT
systems without supporting them properly, with all the characters needed
for their correct writing.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Jun 27 '08 #3

P: n/a
"Jukka K. Korpela" <jk******@cs.tut.fiwrote in message
news:lh******************@reader1.news.saunalahti. fi...
Scripsit Simon:
I'm working on a team that is planning to add Welsh language support
to a large existing IT system which is partially web-based and
English-language-only so far.

Do you plan to add other languages later? Is this about names only or
also about prose texts? After all, ISO-8859-1 is insufficient even for
normal English prose; think about dashes and proper quotations marks.
I've heard that 2 characters in Welsh
(w-circumflex and y-circumflex) are not supported in our default
ISO-8859-1 character set,

Right. They are included in ISO-8859-14 (a.k.a. ISO Latin 8, or
"Celtic"), but thats not a feasible option on the WWW (IE does not
recognize that encoding).
so a partial move to Unicode for internal
storage of text might be required.

That might be easy, or it might be extremely complicated. But that's
really beyond the scope of these groups. As far as WWW authoring is
concerned, Unicode - specifically UTF-8 - is a good option, but you
could keep using ISO-8859-1 and represent those letters using character
references like ŵ for w with circumflex. But you might have to deal
with the encoding problem of the data bases involved, for example, and
with data entry.
I haven't yet found a Welsh-language website that uses these 2
characters, so are they actually used much in Welsh?

I don't know Welsh, but I expect those characters to be so rare that
using some clumsy notation like character references for them wouldn't
be a major problem.
Is not supporting them likely to cause problems?

Some people might say that it is tolerable to omit the circumflex, but
it may be distinctive (i.e. the only difference between otherwise
identical words, thought the context usually resolves the issue). And in
2008, I think it is inappropriate to add support to languages to IT
systems without supporting them properly, with all the characters needed
for their correct writing.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Thanks for your reply.

Unfortunately multi-lingual support has not really been a priority in the
system design up to now,
although it has always been a possible future requirement. The system is a
complex mixture of
databases, Windows applications and web applications. I believe all the
databases and programming
languages we use already support Unicode , so I would aim to use that
support, rather than character
references which would be clumsy as you say.
Jun 27 '08 #4

P: n/a
Scripsit Simon:
I believe all
the databases and programming
languages we use already support Unicode , so I would aim to use that
support, rather than character
references which would be clumsy as you say.
Sounds like a simple way to go then. It is surely simplest to use
Unicode throughout, especially if character data needs to be transferred
between applications as plain text (where no character references or
markup can be used). It's also simplest in data entry if people
immediately see what they have typed, and entering characters with
circumflex should not be a problem; you can e.g. use the keyboard layout
outlined at
http://en.wikipedia.org/wiki/Keyboar...ngdom_extended

Yet, it's always possible that some software component doesn't grok
Unicode. Let's hope such problems are solvable. The web-related
components shouldn't be a problem.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Jun 27 '08 #5

P: n/a
Simon wrote:
Hello -

I'm working on a team that is planning to add Welsh language support to a
large existing IT system which is partially web-based and
English-language-only so far. I've heard that 2 characters in Welsh
(w-circumflex and y-circumflex) are not supported in our default ISO-8859-1
character set, so a partial move to Unicode for internal storage of text
might be required.

I haven't yet found a Welsh-language website that uses these 2 characters,
so are they actually used much in Welsh? Is not supporting them likely to
cause problems?
It could be a support problem (though I don't know why, given the
availability of UTF-8 as well as the option of numeric character
references): see the note at the bottom of

http://www.menai.ac.uk/clicclic/

As made clear at

http://www.cs.cf.ac.uk/fun/welsh/Lesson01.html

the circumflex really is supposed to appear in these locations. (Note
that even on this page, section 1.2 explains that because of support
issues, they are using their own ugly work-around for accented
characters.) Examples are given: "ty^" = "house", along with the pair
"gw^ydd" = "goose" and "gwy^dd" = "trees", which are pronounced differently.
Jun 27 '08 #6

P: n/a
Message-ID: <48***********************@news.gradwell.netfrom Simon
contained the following:
>
Unfortunately (for me) that webpage uses character entities to represent the
characters outside ISO-8859-1. This isn't really a workable approach for me,
because the text I'm displaying will be stored and processed in various
databases and applications (web and non-web). I will probably end up storing
and processing the data using UCS-2 or similar and generating webpages in
UTF-8.

Surely you can add the character entities using a script when the pages
are generated?
--
Geoff Berrow 0110001001101100010000000110
001101101011011001000110111101100111001011
100110001101101111001011100111010101101011
Jun 27 '08 #7

P: n/a
On Tue, 24 Jun 2008 19:15:48 +0100, Simon wrote:
>
I will probably
end up storing and processing the data using UCS-2 or similar and
generating webpages in UTF-8.
I'll add my vote for UTF-8 as the way to go if there's a choice. Either
way there'll be some problems but UTF-8 is likely to be more future-proof
in the long run.

Oh, and I have encountered the need for a w with circumflex, but that was
an old song title so it might have been an archaic Welsh form.

--
Anahata
an*****@treewind.co.uk ==//== 01638 720444
http://www.treewind.co.uk ==//== http://www.myspace.com/maryanahata

Jun 27 '08 #8

P: n/a
Simon wrote:
I've just found a webpage that uses y-circumflex at the end
of the third paragraph, so it can't be that uncommon:
http://news.bbc.co.uk/welsh/hi/newsi...00/7462534.stm
That’s really odd! They have a “y with circumflex” (ŷ)
but write “vowel with ASCII apostrophe” instead of
“vowel with acute” throughout.

Btw:
You should start with your newsreader^W Outlook Express
in supporting special, non-ASCII characters:

Tools → Options → Send
Mail Sending Format → Plain Text Settings → Message format MIME
News Sending Format → Plain Text Settings → Message format MIME
Encode text using: None

Otherwise you cannot even write or quote

1 € = 100 ¢
Jun 27 '08 #9

P: n/a
Simon wrote:
I've heard that 2 characters in Welsh (w-circumflex and y-circumflex)
are not supported in our default ISO-8859-1 character set, so a
partial move to Unicode for internal storage of text might be required.

I haven't yet found a Welsh-language website that uses these 2 characters,
so are they actually used much in Welsh? Is not supporting them likely to
cause problems?
ISO-8859-1 does not even contain a euro sign (€), which seems to be
an even stronger argument to move to Unicode asap than the missing
Ŵ ŵ Ŷ ŷ for Welsh.
Jun 27 '08 #10

P: n/a

Holy crap. I'm looking at two of your posts, and both in the body and in
the article's line in the headers pane, your name is not in the font I
have configured. And it's a *different* not-configured-by-me font in the
body than in the headers pane.
--
Blinky
Killing all posts from Google Groups
The Usenet Improvement Project -- http://improve-usenet.org
Found: a free GG-blocking news *feed* -- http://usenet4all.se

Jun 27 '08 #11

P: n/a
Blinky the Shark wrote:
Holy crap. I'm looking at two of your posts, and both in the body and in
the article's line in the headers pane, your name is not in the font I
have configured. And it's a *different* not-configured-by-me font in the
body than in the headers pane.
I noticed the same thing, in Thunderbird.
Jun 27 '08 #12

P: n/a
Harlan Messinger wrote:
Blinky the Shark wrote:
>Holy crap. I'm looking at two of your posts, and both in the body and in
the article's line in the headers pane, your name is not in the font I
have configured. And it's a *different* not-configured-by-me font in the
body than in the headers pane.

I noticed the same thing, in Thunderbird.
His FROM line reads

From: =?UTF-8?B?77yh772O772E772S772F772B772T44CA77yw772S772J77 2M772P772Q?=

I don't know what to make of this.
Jun 27 '08 #13

P: n/a
Harlan Messinger <hm*******************@comcast.netwrites:
Harlan Messinger wrote:
>Blinky the Shark wrote:
>>Holy crap. I'm looking at two of your posts, and both in the body and in
the article's line in the headers pane, your name is not in the font I
have configured. And it's a *different* not-configured-by-me font in the
body than in the headers pane.

I noticed the same thing, in Thunderbird.

His FROM line reads

From: =?UTF-8?B?77yh772O772E772S772F772B772T44CA77yw772S772J77 2M772P772Q?=

I don't know what to make of this.
If I cut and paste to my utf-8-dump program:

$ utf-8-dump -f '[%u] %n\n'
Andreas 
[U+FF21] FULLWIDTH LATIN CAPITAL LETTER A
[U+FF4E] FULLWIDTH LATIN SMALL LETTER N
[U+FF44] FULLWIDTH LATIN SMALL LETTER D
[U+FF52] FULLWIDTH LATIN SMALL LETTER R
[U+FF45] FULLWIDTH LATIN SMALL LETTER E
[U+FF41] FULLWIDTH LATIN SMALL LETTER A
[U+FF53] FULLWIDTH LATIN SMALL LETTER S
[U+3000] IDEOGRAPHIC SPACE
[U+000A] <control>

Presumably your newsreader thinks it needs a separate font to find
suitable gyphs for these characters (mine does too).

--
Ben.
Jun 27 '08 #14

P: n/a
Scripsit Andreas Prilop:
ISO-8859-1 does not even contain a euro sign (€), which seems to be
an even stronger argument to move to Unicode asap than the missing
Ŵ ŵ Ŷ ŷ for Welsh.
Not really, because
a) the UK does not use the euro currency
b) the euro sign can conveniently be written using the entity reference
&euro;
c) the euro sign should not be used in normal text, according to
reputable language authorities; instead, the currency name should be
written, except perhaps in tables and other contexts where saving space
is crucial.

For commercial pages oriented towards countries using the euro, the euro
sign is needed, but it’s not really comparable to the issue of letters
needed for proper writing of a language.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Jun 27 '08 #15

P: n/a
On Wed, 25 Jun 2008, Jukka K. Korpela wrote:
>ISO-8859-1 does not even contain a euro sign (), which seems
to be an even stronger argument to move to Unicode asap

Not really, because
a) the UK does not use the euro currency
By that logic, they won't need a dollar sign on their keyboards.
b) the euro sign can conveniently be written using the entity
reference &euro;
In HTML. But IIRC, the OP wrote of some "large existing IT system"
with internal ISO-8859-1 character set. I wonder if one could
write &euro; there.
Jun 27 '08 #16

P: n/a
Message-ID: <87************@bsb.me.ukfrom Ben Bacarisse contained the
following:
>Presumably your newsreader thinks it needs a separate font to find
suitable gyphs for these characters (mine does too).
Yup, just get a load of question marks in Agent.

--
Geoff Berrow 0110001001101100010000000110
001101101011011001000110111101100111001011
100110001101101111001011100111010101101011
Jun 27 '08 #17

P: n/a
In comp.infosystems.www.authoring.html message <kv6dnRvtYfLOa_zVnZ2dnUVZ
8h******@posted.plusnet>, Wed, 25 Jun 2008 02:40:03, anahata
<an*****@treewind.co.ukposted:
>
Oh, and I have encountered the need for a w with circumflex, but that was
an old song title so it might have been an archaic Welsh form.
A search for "Welsh Water" rapidly locates <http://www.welshwater.com/>,
in the foot of which is "Dwr Cymru Cyf 2008." Copy'n'paste into here
has not reproduced the circumflex over the w; but graphic copy'n'paste
into Paint, zoomed, reveals it well.

See also <http://www.dwrcymru.co.uk/Welsh/Contactus/index.asp>, or
<http://cy.wikipedia.org/wiki/Hafan>.

ISTM unlikely that the Welsh could manage without their word for water.

--
(c) John Stockton, nr London UK. replyYYWW merlyn demon co uk Turnpike 6.05.
Web <URL:http://www.uwasa.fi/~ts/http/tsfaq.html-Timo Salmi: Usenet Q&A.
Web <URL:http://www.merlyn.demon.co.uk/news-use.htm: about usage of News.
No Encoding. Quotes precede replies. Snip well. Write clearly. Mail no News.
Jun 27 '08 #18

P: n/a
Andreas Prilop wrote:

<snip>

For the record, I see no unusual font behavior in the From field from that
post.
--
Blinky
Is your ISP dropping Usenet?
Need a new feed?
http://blinkynet.net/comp/newfeed.html

Jun 27 '08 #19

P: n/a
On Wed, 25 Jun 2008 21:48:14 +0200, Dr J R Stockton
<jr*@merlyn.demon.co.ukwrote:
In comp.infosystems.www.authoring.html message <kv6dnRvtYfLOa_zVnZ2dnUVZ
8h******@posted.plusnet>, Wed, 25 Jun 2008 02:40:03, anahata
<an*****@treewind.co.ukposted:
>>
Oh, and I have encountered the need for a w with circumflex, but that
was
an old song title so it might have been an archaic Welsh form.

A search for "Welsh Water" rapidly locates <http://www.welshwater.com/>,
in the foot of which is "Dwr Cymru Cyf 2008." Copy'n'paste into here
has not reproduced the circumflex over the w;
Hmmm. I would find that hard to believe (probably just a newsreader
thing), just a test (UTF-8 posting):
Dŵr Cymru Welsh Water
--
Rik Wasmus
Jun 27 '08 #20

P: n/a
On Thu, 26 Jun 2008 08:08:36 +0200, Rik Wasmus
<lu************@hotmail.comwrote:
just a test (UTF-8 posting):
Dŵr Cymru Welsh Water
Well, at least here it's clearly there :)
--
Rik Wasmus
....spamrun finished
Jun 27 '08 #21

P: n/a
Op 25-06-08 18:01 heeft Ben Bacarisse als volgt van zich laten horen:
Harlan Messinger <hm*******************@comcast.netwrites:
>Harlan Messinger wrote:
>>Blinky the Shark wrote:
Holy crap. I'm looking at two of your posts, and both in the body and in
the article's line in the headers pane, your name is not in the font I
have configured. And it's a *different* not-configured-by-me font in the
body than in the headers pane.
I noticed the same thing, in Thunderbird.
His FROM line reads

From: =?UTF-8?B?77yh772O772E772S772F772B772T44CA77yw772S772J77 2M772P772Q?=

I don't know what to make of this.

If I cut and paste to my utf-8-dump program:

$ utf-8-dump -f '[%u] %n\n'
Andreas 
[U+FF21] FULLWIDTH LATIN CAPITAL LETTER A
[U+FF4E] FULLWIDTH LATIN SMALL LETTER N
[U+FF44] FULLWIDTH LATIN SMALL LETTER D
[U+FF52] FULLWIDTH LATIN SMALL LETTER R
[U+FF45] FULLWIDTH LATIN SMALL LETTER E
[U+FF41] FULLWIDTH LATIN SMALL LETTER A
[U+FF53] FULLWIDTH LATIN SMALL LETTER S
[U+3000] IDEOGRAPHIC SPACE
[U+000A] <control>
Those are actually fake Latin letters, which are used in Japanese and
Chinese systems, since the CJK symbols are broader, therefore they have
broader latin letters as well. It's a mean trick to make your name look
different without using html.
Presumably your newsreader thinks it needs a separate font to find
suitable gyphs for these characters (mine does too).
That's very probable, since most fonts won't contain those glyphs.
You'd need a Chinese/Japanese font which contains them.

H.
Jun 27 '08 #22

P: n/a
In uk.net.web.authoring message <op***************@metallium.lan>, Thu,
26 Jun 2008 08:08:36, Rik Wasmus <lu************@hotmail.composted:
>A search for "Welsh Water" rapidly locates <http://www.welshwater.com/>,
in the foot of which is "Dwr Cymru Cyf 2008." Copy'n'paste into here
has not reproduced the circumflex over the w;

Hmmm. I would find that hard to believe (probably just a newsreader
thing),
Then you should read my headers.

--
(c) John Stockton, nr London UK. ??*@merlyn.demon.co.uk Turnpike v6.05 MIME.
Web <URL:http://www.merlyn.demon.co.uk/- FAQish topics, acronyms, & links.
Check boilerplate spelling -- error is a public sign of incompetence.
Never fully trust an article from a poster who gives no full real name.
Jun 27 '08 #23

P: n/a
Andreas Prilop wrote:
On Wed, 25 Jun 2008, Jukka K. Korpela wrote:
>a) the UK does not use the euro currency

By that logic, they won't need a dollar sign on their keyboards.
The dollar key is kinda handy for some programming languages (Perl, PHP,
XSLT, others too).

The missing '#' on UK Mac keyboards is very annoying. Although Alt+3 does
make it appear, it's easier to just remap the '§' key. (And why they felt
the need to have '§' when they don't have a '#' or '€' remains a mystery!)

--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.24.4-1mnbcustom-g5n1, up 43 days, 19:44.]

Olympics Monkey
http://tobyinkster.co.uk/blog/2008/0...lympic-monkey/
Aug 4 '08 #24

P: n/a
JC
Dwr Cymru Cyf
Dwr Cymru Welsh Water
The Windows Courier font does not seem to support UTF-8. Outlook
Express changes the font from Courier to a variety of Arial when the
Encoding settings get changed to Unicode (UTF-8).

--
Jim Carlock
More Than Five Senses... We All Know More.
http://www.associatedcontent.com/art...ve_senses.html
Aug 25 '08 #25

P: n/a
On Mon, 25 Aug 2008, JC wrote:
The Windows Courier font does not seem to support UTF-8.
The *bitmap font* Courier, which comes with MS Windows and
which has nothing to do with other PostScript/TrueType fonts
of the same name, has a *very* restricted character repertoire.
You should use Courier New (TrueType) instead.

--
http://groups.google.com/groups/sear...Alan.J.Flavell
Aug 26 '08 #26

This discussion thread is closed

Replies have been disabled for this discussion.