473,769 Members | 5,862 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

UTF-8 & Unicode

Do web pages have to be created in unicode in order to use UTF-8 encoding?
If so, can anyone name a free application which I can use under Windows 98
to create web pages?
Jul 20 '05
27 5148
In article <Pt************ ***@newsfe1-win.ntli.net>,
EU citizen <no*******@form e.com> wrote:
Through experimentation with the W3C HTML vakidator, I've worked
out that iso-8859-1will work for Notepad files with standard english text
plus acute accented vowels.
Beware that Microsoft uses some proprietary encodings that are ISO-8859-1
for characters A0-FF, but use the C1 controls (81-9F) for other purposes.
If you don't use any of those (and the Euro symbol is quite likely one
of them) you should be OK.
The need for the XML encoding statement to match the original file format
was not mentioned in any of the (many) articles I've read on XM:/XHTML over
the last *four* years.


In most circumstances UTF-8 is the default encoding for XML if there
is no encoding declaration. In theory for text/* served by HTTP,
8859-1 is (or was - they may have changed it) the default. But if you
stick to ascii, it won't matter. And remember that you *can* stick to
ASCII and use character references (such as &#xa3;) or entity
references (if you declare them in your DTD) for all non-ascii
characters.

-- Richard
Jul 20 '05 #11
On Wed, 2 Feb 2005, EU citizen wrote:
X-Newsreader: Microsoft Outlook Express 6.00.2800.1437

XML documents can contain foreign characters like Norwegian ???, or French
???.


You need to set up your newsreader^W Outlook Express correctly
in order to transmit special, non-ASCII characters:

Tools > Options > Send
Mail Sending Format > Plain Text Settings > Message format MIME
News Sending Format > Plain Text Settings > Message format MIME
Encode text using: None

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #12
On Wed, 2 Feb 2005, Richard Tobin wrote:
In most circumstances UTF-8 is the default encoding for XML if there
is no encoding declaration.
It's a bit more complicated than that.

http://www.w3.org/TR/2000/REC-xml-20001006#charencoding

The /default/ is to look for a BOM - failing which, utf-8 is assumed.

On the other hand, it seems you've caught me out with the next bit:
In theory for text/* served by HTTP,
8859-1 is (or was - they may have changed it) the default.
HTTP hasn't changed. RFC2616 section 3.7.1, last paragraph. Thanks!

So I suppose /that/ was the explanation for the W3C validator not
failing the cited page from w3schools. Thanks.
But if you stick to ascii, it won't matter.


True - although that's hardly a very efficient way to write, say,
Cyrillic, or Arabic, or Japanese.
Jul 20 '05 #13
EU citizen wrote:
My original question asked for suggestions about suitable applications, and
yet no one has named one.


If you cared to take the time to read the guide to unicode I linked to
earlier, you would have found editors mentioned in part 2. Within it, I
mentioned two windows editors that support Unicode: SuperEdi [1] and
Macromedia Dreamweaver. A simple search for "Unicode Editor" also
reveals many other editors that may be capable of doing the job.

[1] http://www.wolosoft.com/en/superedi/

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://SpreadFirefox.com/ Igniting the Web
Jul 20 '05 #14
EU citizen wrote:
"Richard Tobin" <ri*****@cogsci .ed.ac.uk> wrote in message
news:ct******** **@pc-news.cogsci.ed. ac.uk...
<?xml version="1.0" encoding="whate ver-the-notepad-encoding-is"?>
Based on what I know now, I agree. I always assumed that Notepad, being a
simple text editor, saved files in Ascii format.


By default, Notepad saves files as Windows-1252. The characters from 0
to 127 (0x7F) are identical to US-ASCII, ISO-8859-1, UTF-8 and many
other character sets that make use of the same subset. Thus, any file
saved using Windows-1252 that only makes use of those characters is
compatible with all those other encodings.

The characters from 160 (0xA0) to 255 (0xFF) match those contained in
ISO-8859-1. Thus, any file saved using Windows-1252 that only makes use
of the aforementioned US-ASCII subset and that range of characters is
compatible with ISO-8859-1.

The characters from 128 (0x80) to 159 (0x9F), however, do not match
those in any other encoding, making any Windows-1252 file using these
characters incompatible with any other encoding. For XML, this must be
declared appropriately in the XML declaration. The characters in this
range contain the infamous "smart quotes" (Left and Right, single and
double quotation marks: ‘ ’ “ ”) that cause so many problems for the
uneducated. Use of this range while declaring ISO-8859-1, UTF-8 or any
other encoding, will cause errors because they are control characters in
the character repertoires used by those encodings.
Nothing in Notepad's Help, Windows' Help or Microsoft's website says anything
about the formt used by Notepad.
It is actually mentioned in a few places on the web, though it's not
easy to find. Microsoft tend to incorrectly refer to it as ANSI, even
though it is not.

Through experimentation with the W3C HTML vakidator, I've worked out that
iso-8859-1will work for Notepad files with standard english text plus acute
accented vowels.


That's because Windows-1252 is compatible with ISO-8859-1 when that
subset is used.
Windows 95/98 Notepad files must be saved with an encoding attribute.


This is mysterious. What does it mean? That Notepad won't save
them without one? Or that you have to add one to make it work
in the web browser?


I can't make head or tail of it.


It actually means that version of Notepad will only save as
Windows-1252, so it needs to be declared in the XML declaration. That
is because an XML parser will assume UTF-8 without it and that
assumption is acceptable only when the US-ASCII subset is used.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://SpreadFirefox.com/ Igniting the Web
Jul 20 '05 #15
On Wed, 2 Feb 2005, EU citizen wrote:
The need for the XML encoding statement to match the original file
format was not mentioned in any of the (many) articles I've read on
XM:/XHTML over the last *four* years.


The XML coding has to comply with the relevant bit of the XML
specification. Whether you read it "over the last four years" or not.
http://www.w3.org/TR/REC-xml/#charencoding

Talking about the "original file format" could be misleading, bearing
in mind that some HTTP servers are set up to transcode the
internally-stored file format into one that's more appropriate for use
on the web. For XML-based markups, that may call for appropriate
rewriting of the document's XML encoding specification. And if you're
using XHTML/1.0 Appendix C then the transcoded document would need to
confirm to its constraints too.

Jul 20 '05 #16
Alan J. Flavell wrote:
On Wed, 2 Feb 2005, EU citizen wrote:

The need for the XML encoding statement to match the original file
format was not mentioned in any of the (many) articles I've read on
XM:/XHTML over the last *four* years.

The XML coding has to comply with the relevant bit of the XML
specification. Whether you read it "over the last four years" or not.
http://www.w3.org/TR/REC-xml/#charencoding

Talking about the "original file format" could be misleading, bearing
in mind that some HTTP servers are set up to transcode the
internally-stored file format into one that's more appropriate for use
on the web. For XML-based markups, that may call for appropriate
rewriting of the document's XML encoding specification. And if you're
using XHTML/1.0 Appendix C then the transcoded document would need to
confirm to its constraints too.


RFC3023 talk about XML media types

i retain that text/xml (and text/and-others-related-to-xml) should be
avoid on behalf of application/xml (and
application/and-others-related-to-xml)

Here we get utf-8:
Content-type: text/xml; charset="utf-8"
<?xml version="1.0" encoding="utf-8"?>

!?!?! Here we get US-ACII, despite the encoding specified:
Content-type: text/xml
<?xml version="1.0" encoding="utf-8"?>

Here we get utf-16:
Content-type: application/xml; charset="utf-16"
{BOM}<?xml version="1.0" encoding="utf-16"?>

Here we get the right encoding-known-by-your-parser:
Content-type: application/xml
<?xml version="1.0" encoding="encod ing-known-by-your-parser"?>

--
Cordialement,

///
(. .)
-----ooO--(_)--Ooo-----
| Philippe Poulard |
-----------------------
Jul 20 '05 #17
In article <Pi************ *************** ***@ppepc56.ph. gla.ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla .ac.uk> wrote:
The /conclusions/ are fine, in their way:

* Use an editor that supports encoding.
* Make sure you know what encoding it uses.
* Use the same encoding attribute in your XML documents.


That is not a safe conclusion. XML processors are only required to
support UTF-8 and UTF-16. Support for any other encoding is an XML
processor-specific extra feature. It follows that using any encoding
other than UTF-8 or UTF-16 is unsafe. If communication fails, because
someone sent an XML document in an encoding other than UTF-8 or UTF-16,
the sender is to blame.

This simplifies to a rule of thumb:
When producing XML, always use UTF-8 (and Unicode Normalization Form C).
Those who absolutely insist on using UTF-16 can use UTF-16 instead of
UTF-8.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #18
Henri Sivonen wrote:
In article <Pi************ *************** ***@ppepc56.ph. gla.ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla .ac.uk> wrote:

The /conclusions/ are fine, in their way:

* Use an editor that supports encoding.
* Make sure you know what encoding it uses.
* Use the same encoding attribute in your XML documents.

That is not a safe conclusion. XML processors are only required to
support UTF-8 and UTF-16. Support for any other encoding is an XML
processor-specific extra feature. It follows that using any encoding
other than UTF-8 or UTF-16 is unsafe. If communication fails, because
someone sent an XML document in an encoding other than UTF-8 or UTF-16,
the sender is to blame.

This simplifies to a rule of thumb:
When producing XML, always use UTF-8 (and Unicode Normalization Form C).
Those who absolutely insist on using UTF-16 can use UTF-16 instead of
UTF-8.


this is theory

is there anybody who knows a parser that doesn't handle iso-8859-1
corresctly ? i don't think so; otherwise, you should change, and
communication became safe :)

--
Cordialement,

///
(. .)
-----ooO--(_)--Ooo-----
| Philippe Poulard |
-----------------------
Jul 20 '05 #19
On Fri, 4 Feb 2005, Henri Sivonen wrote:
"Alan J. Flavell" <fl*****@ph.gla .ac.uk> wrote:
The /conclusions/ are fine, in their way:

* Use an editor that supports encoding.
* Make sure you know what encoding it uses.
* Use the same encoding attribute in your XML documents.
That is not a safe conclusion.


I guess that was one of the penalties of responding to a cross-posted
article.
XML processors are only required to
support UTF-8 and UTF-16. Support for any other encoding is an XML
processor-specific extra feature.
But that's OK, since any plausible encoding produced by the editor can
be transformed by rote into utf-8 prior to subsequent XML processing
(that's the XML relevance). And pretty much any plausible encoding
produced by an editor that's meant for WWW use, is going to be
supported by the available web browsers (that's the c.i.w.a.h
relevance).
It follows that using any encoding other than UTF-8 or UTF-16 is
unsafe.


I take your point, but again: as long as the document is correctly
labelled, it can be transformed by rote into utf-8, it needs no
special heuristics, nor does it run risks of being damaged in the
process.

all the best

Jul 20 '05 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
4191
by: lawrence | last post by:
Someone on www.php.net suggested using a seems_utf8() method to test text for UTF-8 character encoding but didn't specify how to write such a method. Can anyone suggest a test that might work? Something that maybe gives 90% confidence that a given block of text is or is not UTF-8 encoded?
3
3384
by: aa | last post by:
Is it OK to include an ANSI file into a UTF-8 file?
38
5739
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I find an answer to this question (don't find it in the W3C_char_entities document). -- Haines Brown brownh@hartford-hwp.com
48
4643
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
7
5002
by: Philipp Lenssen | last post by:
How do I load and save a UTF-8 document in XML in ASP/VBS? Well, the loading* is not the problem actually -- the file is in UTF-8, and understood correctly -- but once saved, the UTF-8 is replaced by what seems to be iso-8859-1 (which Flash doesn't understand, but that's another problem). Any help greatly appreciated. * Something like this...
6
18765
by: jmgonet | last post by:
Hello everybody, I'm having troubles loading a Xml string encoded in UTF-8. If I try this code: ------------------------------ XmlDocument doc=new XmlDocument(); String s="<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?><a>Schönbühl</a>"; doc.LoadXml(s); doc.Save("d:\\temp\\test.xml");
7
12151
by: Jimmy Shaw | last post by:
Hi everybody, Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be mixed up, but is it possible that all UTF-16 "code points" that are 16 bits long appear just the same in UTF-32, but with zero padding and hence no real conversion is necessary? If I am completely wrong and some intricate conversion operation needs to take place, can anyone give me some primer on the subject?
10
19579
by: Jed | last post by:
I have a form that needs to handle international characters withing the UTF-8 character set. I have tried all the recommended strategies for getting utf-8 characters from form input to email message and I cannot get it to work. I need to stay with classic asp for this. Here are some things I tried: 'CDONTS Call msg.SetLocaleIDs(65001)
23
5027
by: Allan Ebdrup | last post by:
I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars from another HTML page viewed in IE7, that is also UTF-8 encoded (search for "china" on google.com). I paste the chineese chars into a content editable div. My Ajax webservice compiles an XML where the data from the content editable div is...
4
6875
by: =?ISO-8859-2?Q?Boris_Du=B9ek?= | last post by:
Hi, I have an API that returns UTF-8 encoded strings. I have a utf8 codevt facet available to do the conversion from UTF-8 to wchar_t encoding defined by the platform. I have no trouble converting when a UTF-8 encoded string comes from file - I just create a std::wifstream and imbue it with a locale that uses the utf-8 facet for std::locale::ctype. Then I just use operator>to get wstring properly decoded from UTF-8. I thought I could...
0
9579
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main usage, and What is the difference between ONU and Router. Lets take a closer look ! Part I. Meaning of...
0
9416
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10199
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10032
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9849
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6661
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5433
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3551
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2810
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.