473,898 Members | 2,695 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Unicode BOM marks

Hi,

For the first time in my programmer life, I have to take care of character
encoding. I have a question about the BOM marks.

If I understand well, into the UTF-8 unicode binary representation, some
systems add at the beginning of the file a BOM mark (Windows?), some don't.
(Linux?). Therefore, the exact same text encoded in the same UTF-8 will
result in two different binary files, and of a slightly different length.
Right ?

I guess that this leading BOM mark are special marking bytes that can't be, in
no way, decoded as valid text.
Right ?
(I really really hope the answer is yes otherwise we're in hell when moving
file from one platform to another, even with the same Unicode encoding).

I also guess that this leading BOM mark is silently ignored by any unicode
aware file stream reader to which we already indicated that the file follows
the UTF-8 encoding standard.
Right ?

If so, is it the case with the python codecs decoder ?

In python documentation, I see theseconstants. The documentation is not clear
to which encoding these constants apply. Here's my understanding :

BOM : UTF-8 only or UTF-8 and UTF-32 ?
BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_UTF8 : UTF-8 only
BOM_UTF16 : UTF-16 only
BOM_UTF16_BE : UTF-16 only
BOM_UTF16_LE : UTF-16 only
BOM_UTF32 : UTF-32 only
BOM_UTF32_BE : UTF-32 only
BOM_UTF32_LE : UTF-32 only

Why should I need these constants if codecs decoder can handle them without my
help, only specifying the encoding ?

Thank you

Francis Girard


Python tells me to use an encoding declaration at the top of my files (the
message is referring to http://www.python.org/peps/pep-0263.html).

I expected to see there a list of acceptable

Jul 18 '05 #1
8 3671
Francis Girard wrote:
If I understand well, into the UTF-8 unicode binary representation, some
systems add at the beginning of the file a BOM mark (Windows?), some don't.
(Linux?). Therefore, the exact same text encoded in the same UTF-8 will
result in two different binary files, and of a slightly different length.
Right ?
Mostly correct. I would prefer if people referred to the thing not as
"BOM" but as "UTF-8 signature", atleast in the context of UTF-8, as
UTF-8 has no byte-order issues that a "byte order mark" would deal with.
(it is correct to call it "BOM" in the context of UTF-16 or UTF-32).

Also, "some systems" is inadequate. It is not so much the operating
system that decides to add or leave out the UTF-8 signature, but much
more the application writing the file. Any high-quality tool will accept
the file with or without signature, whether it is a tool on Windows
or a tool on Unix.

I personally would write my applications so that they put the signature
into files that cannot be concatenated meaningfully (since the
signature simplifies encoding auto-detection) and leave out the
signature from files which can be concatenated (as concatenating the
files will put the signature in the middle of a file).

I guess that this leading BOM mark are special marking bytes that can't be, in
no way, decoded as valid text.
Right ?
Wrong. The BOM mark decodes as U+FEFF:
codecs.BOM_UTF8 .decode("utf-8")

u'\ufeff'

This is what makes it a byte order mark: in UTF-16, you can tell the
byte order by checking whether it is FEFF or FFFE. The character U+FFFE
is an invalid character, which cannot be decoded as valid text
(although the Python codec will decode it as invalid text).
I also guess that this leading BOM mark is silently ignored by any unicode
aware file stream reader to which we already indicated that the file follows
the UTF-8 encoding standard.
Right ?
No. It should eventually be ignored by the application, but whether the
stream reader special-cases it or not is depends on application needs.
If so, is it the case with the python codecs decoder ?
No; the Python UTF-8 codec is unaware of the UTF-8 signature. It reports
it to the application when it finds it, and it will never generate the
signature on its own. So processing the UTF-8 signature is left to the
application in Python.
In python documentation, I see theseconstants. The documentation is not clear
to which encoding these constants apply. Here's my understanding :

BOM : UTF-8 only or UTF-8 and UTF-32 ?
UTF-16.
BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?
UTF-16
Why should I need these constants if codecs decoder can handle them without my
help, only specifying the encoding ?


Well, because the codecs don't. It might be useful to add a
"utf-8-signature" codec some day, which generates the signature on
encoding, and removes it on decoding.

Regards,
Martin
Jul 18 '05 #2
Le lundi 7 Mars 2005 21:54, "Martin v. Löwis" a écrit*:

Hi,

Thank you for your very informative answer. Some interspersed remarks follow.

I personally would write my applications so that they put the signature
into files that cannot be concatenated meaningfully (since the
signature simplifies encoding auto-detection) and leave out the
signature from files which can be concatenated (as concatenating the
files will put the signature in the middle of a file).

Well, no text files can't be concatenated ! Sooner or later, someone will use
"cat" on the text files your application did generate. That will be a lot of
fun for the new unicode aware "super-cat".
I guess that this leading BOM mark are special marking bytes that can't
be, in no way, decoded as valid text.
Right ?


Wrong. The BOM mark decodes as U+FEFF:
>>> codecs.BOM_UTF8 .decode("utf-8")
u'\ufeff'


I meant "valid text" to denote human readable actual real natural language
text. My intent with this question was to get sure that we can easily
distinguish a UTF-8 with the signature from one without. Your answer implies
a "yes".
I also guess that this leading BOM mark is silently ignored by any
unicode aware file stream reader to which we already indicated that the
file follows the UTF-8 encoding standard.
Right ?


No. It should eventually be ignored by the application, but whether the
stream reader special-cases it or not is depends on application needs.


Well, for most of us, I think, the need is to transparently decode the input
into a unique internal unicode encoding (UFT-16 for both java and Qt ; Qt
docs saying there might be a need to switch to UFT-32 some day) and then be
able to manipulate this internal text with the usual tools your programming
system provides. By "transparen t", I mean, at least, to be able to
automatically process the two variants of the same UTF-8 encoding. We should
only have to specify "UTF-8" and the streamer takes care of the rest.

BTW, the python "unicode" built-in function documentation says it returns a
"unicode" string which scarcely means something. What is the python
"internal" unicode encoding ?

No; the Python UTF-8 codec is unaware of the UTF-8 signature. It reports
it to the application when it finds it, and it will never generate the
signature on its own. So processing the UTF-8 signature is left to the
application in Python.
Ok.
In python documentation, I see theseconstants. The documentation is not
clear to which encoding these constants apply. Here's my understanding :

BOM : UTF-8 only or UTF-8 and UTF-32 ?


UTF-16.
BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?


UTF-16

Ok.
Why should I need these constants if codecs decoder can handle them
without my help, only specifying the encoding ?


Well, because the codecs don't. It might be useful to add a
"utf-8-signature" codec some day, which generates the signature on
encoding, and removes it on decoding.

Ok.

My sincere thanks,

Francis Girard
Regards,
Martin


Jul 18 '05 #3
On Mon, Mar 07, 2005 at 11:56:57PM +0100, Francis Girard wrote:
BTW, the python "unicode" built-in function documentation says it returnsa
"unicode" string which scarcely means something. What is the python
"internal" unicode encoding ?


The language reference says farily little about unicode objects. Here's
what it does say: [http://docs.python.org/ref/types.html#l2h-48]
Unicode
The items of a Unicode object are Unicode code units. A Unicode
code unit is represented by a Unicode object of one item and can
hold either a 16-bit or 32-bit value representing a Unicode
ordinal (the maximum value for the ordinal is given in
sys.maxunicode, and depends on how Python is configured at
compile time). Surrogate pairs may be present in the Unicode
object, and will be reported as two separate items. The built-in
functions unichr() and ord() convert between code units and
nonnegative integers representing the Unicode ordinals as
defined in the Unicode Standard 3.0. Conversion from and to
other encodings are possible through the Unicode method encode
and the built-in function unicode().

In terms of the CPython implementation, the PyUnicodeObject is laid out
as follows:
typedef struct {
PyObject_HEAD
int length; /* Length of raw Unicode data in buffer*/
Py_UNICODE *str; /* Raw Unicode buffer */
long hash; /* Hash value; -1 if not set */
PyObject *defenc; /* (Default) Encoded version as Python
string, or NULL; this is used for
implementing the buffer protocol */
} PyUnicodeObject ;
Py_UNICODE is some "C" integral type that can hold values up to
sys.maxunicode (probably one of unsigned short, unsigned int, unsigned
long, wchar_t).

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFCLOdqJd0 1MZaTXX0RAqyCAJ 4mUgO1YqRbu+ElF UYQkQrjq0WobgCc CqSo
1CicckGcZKYTbQo BeKKQs5I=
=QhqS
-----END PGP SIGNATURE-----

Jul 18 '05 #4
Francis Girard wrote:
Well, no text files can't be concatenated ! Sooner or later, someone will use
"cat" on the text files your application did generate. That will be a lot of
fun for the new unicode aware "super-cat".
Well, no. For example, Python source code is not typically concatenated,
nor is source code in any other language. The same holds for XML files:
concatenating two XML documents (using cat) gives an ill-formed document
- whether the files start with an UTF-8 signature or not.

As for the "super-cat": there is actually no problem with putting U+FFFE
in the middle of some document - applications are supposed to filter it
out. The precise processing instructions in the Unicode standard vary
from Unicode version to Unicode version, but essentially, you are
supposed to ignore the BOM if you see it.
BTW, the python "unicode" built-in function documentation says it returns a
"unicode" string which scarcely means something. What is the python
"internal" unicode encoding ?


A Unicode string is a sequence of integers. The numbers are typically
represented as base-2, but the details depend on the C compiler.
It is specifically *not* UTF-16, big or little endian (i.e. a single
number is *not* a sequence of bytes). It may be UCS-2 or UCS-4,
depending on a compile-time choice (which can be determined by looking
at sys.maxunicode, which in turn can be either 65535 or 1114111).

The programming interface to the individual characters is formed by
the unichr and ord builtin functions, which expect and return integers
between 0 and sys.maxunicode.

Regards,
Martin

Jul 18 '05 #5
Hi,
Well, no. For example, Python source code is not typically concatenated,
nor is source code in any other language.
We did it with C++ files in order to have only one compilation unit to
accelarate compilation time over network. Also, all the languages with some
"include" directive will have to take care of it. I guess a unicode aware C
pre-compiler already does.
As for the "super-cat": there is actually no problem with putting U+FFFE
in the middle of some document - applications are supposed to filter it
out. The precise processing instructions in the Unicode standard vary
from Unicode version to Unicode version, but essentially, you are
supposed to ignore the BOM if you see it.
Ok. I'm re-assured.
A Unicode string is a sequence of integers. The numbers are typically
represented as base-2, but the details depend on the C compiler.
It is specifically *not* UTF-16, big or little endian (i.e. a single
number is *not* a sequence of bytes). It may be UCS-2 or UCS-4,
depending on a compile-time choice (which can be determined by looking
at sys.maxunicode, which in turn can be either 65535 or 1114111).

The programming interface to the individual characters is formed by
the unichr and ord builtin functions, which expect and return integers
between 0 and sys.maxunicode.


Ok. I guess that Python gives the flexibility of being configurable (when
compiling Python) to internally represent unicode strings as fixed 2 or 4
bytes per characters (UCS).

Thank you
Francis Girard

Jul 18 '05 #6

""Martin v. Löwis"" <ma****@v.loewi s.de> wrote in message
news:42******** *************** @news.freenet.d e...
Francis Girard wrote:
Well, no text files can't be concatenated ! Sooner or later, someone will
use "cat" on the text files your application did generate. That will be a
lot of fun for the new unicode aware "super-cat".
Well, no. For example, Python source code is not typically concatenated,
nor is source code in any other language. The same holds for XML files:
concatenating two XML documents (using cat) gives an ill-formed document
- whether the files start with an UTF-8 signature or not.


And if you're talking HTML and XML, the situation is even worse, since
the application absolutely needs to be aware of the signature. HTML might
have a <meta ... > directive close to the front to tell you what the
encoding
is supposed to be, and then again, it might not. You should be able to
depend
on the first character being a <, but you might not be able to. FitNesse,
for
example, sends FIT a file that consists of the HTML between the <body>
and </body> tags, and nothing else. This situation makes character set
detection in PyFit, um, interesting. (Fortunately, I have other ways of
dealing with FitNesse, but it's still an issue for batch use.)
As for the "super-cat": there is actually no problem with putting U+FFFE
in the middle of some document - applications are supposed to filter it
out. The precise processing instructions in the Unicode standard vary
from Unicode version to Unicode version, but essentially, you are
supposed to ignore the BOM if you see it.
It would be useful for "super-cat" to filter all but the first one, however.

John Roth

Regards,
Martin


Jul 18 '05 #7
Francis Girard wrote:
Le lundi 7 Mars 2005 21:54, "Martin v. Löwis" a écrit :

Hi,

Thank you for your very informative answer. Some interspersed remarks follow.

I personally would write my applications so that they put the signature
into files that cannot be concatenated meaningfully (since the
signature simplifies encoding auto-detection) and leave out the
signature from files which can be concatenated (as concatenating the
files will put the signature in the middle of a file).

Well, no text files can't be concatenated ! Sooner or later, someone will use
"cat" on the text files your application did generate. That will be a lot of
fun for the new unicode aware "super-cat".


It is my understanding that the BOM (U+feff) is actually the
Unicode character "Non-breaking zero-width space". I take
this to mean that the character can appear invisibly
anywhere in text, and its appearance as the first character
of a text is pretty harmless. Concateniating files will
leave invisible space characters in the middle of the text,
but presumably not in the middle of words, so no harm is
done there either.

I suspect that the fact that an explicitly invisible
character feff has an invalid character code fffe for its
byte-reversed counterpart is no accident, and that the
charecter was intended from inception to also server as a
byte order indication.

Steve
Jul 18 '05 #8
Steve Horsley wrote:
It is my understanding that the BOM (U+feff) is actually the Unicode
character "Non-breaking zero-width space".


My understanding is that this used to be the case. According to

http://www.unicode.org/faq/utf_bom.html#38

the application should now specify specific processing, and both
simply dropping it, or reporting an error are both acceptable behaviour.
Applications that need the ZWNBSP behaviour (i.e. want to indicate that
there should be no break at this point) should use U+2060 (WORD JOINER).

Regards,
Martin
Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

19
11909
by: Gerson Kurz | last post by:
AAAAAAAARG I hate the way python handles unicode. Here is a nice problem for y'all to enjoy: say you have a variable thats unicode directory = u"c:\temp" Its unicode not because you want it to, but because its for example read from _winreg which returns unicode. You do an os.listdir(directory). Note that all filenames returned are now unicode. (Change introduced I believe in 2.3).
8
5287
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib
4
2468
by: Majed | last post by:
Hi , all I'm trying to write unicode to a file for another app (not developed with vs2003) to read it. I used StreamWriter with unicode encoding.but I was surprised that the streamwriter adds FFFE to the start of the file,which stopes the other app from reading it!! any idea how to stope it frome doing that,do I have to use another class #####writer that supports unicode? help me Please! Thanks
5
5143
by: Borko | last post by:
hi I am having problems getting unicode characters into VB. Using VB6 (sp3) and Access 2000 Characters are displayed correctly in Access, just when I use ADODB (2.7) to read them in VB i get ? character instead of unicode characters. I will display them in TreeView (capable of Unicode) Is there any patch, fix or something, I know this thing is going around
7
4209
by: Robert | last post by:
Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the screen - best results according to the platform/language settings (mainly XP Home, W2K, ...). Also unicode strings should be displayed as nice as possible at the console with normal print-s to stdout (on varying platforms, different
1
4866
by: jrs_14618 | last post by:
Hello All, This post is essentially a reply a previous post/thread here on this mailing.database.myodbc group titled: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode I was wondering if anybody has experienced the same issues
8
3413
by: rkellerjr | last post by:
I have just delved into the world of Unicode versus Latin1. I've written a quick program (ignore some of the messy code, it's been a work in progress) that traps a string of text that contains Unicode or more aptly said, traps only those lines that Perl cannot translate to Latin1. My goal was to trap and then substitute the Unicode character(s) within that string with something to my liking from the Latin1 character set. For example, Perl will...
0
5073
by: deloford | last post by:
Hi This is going to be a question for anyone who is an expert in C# Text Encoding. My situation is this: I have a Sybase database which is firing back ISO-8559 encoded strings. I am unable to get the db to translate to UTF-8 for non technical reasons. So I have a string coming back with the character (ISO value 156). this character appears in .NET as a box character because 156 is not a valid Unicode character value. I have been...
0
571
by: M.-A. Lemburg | last post by:
On 2008-07-01 20:31, Peter Bulychev wrote: You could write a codec which translates Unicode into a ASCII lookalike characters, but AFAIK there is no standard for doing this. I guess the best choice is to use the Unicode code point names as basis. These can be accessed via unicodedata.name(). You can then create a mapping which can be processed by the character map codec.
0
9992
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main usage, and What is the difference between ONU and Router. Lets take a closer look ! Part I. Meaning of...
0
9839
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10853
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10943
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
8034
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
7187
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
4701
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
4295
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3303
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.