UTF16, BOM, and Windows Line endings

Fuzzyman

Hello all,

I'm handling some text files where I don't (necessarily) know the
encoding beforehand. Because I use regular expressions to parse the
text I *must* decode UTF16 encoded text (otherwise the regexes split on
byte boundaries).

I can recognise UTF8 and BOM and remove (but not necessarily decode).
For UTF16 it seems that the Python codec will automatically remove the
BOM. Having detected it (to trigger a decode) is it considered
*invalid* to remove it ? The codec certainly handles the text without a
BOM - I just don't want this part of the code to break later.

Because I don't know the encoding until I've checked for the BOM I have
to read in binary mode. Similarly I have to write in binary mode.

How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ? When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Feb 6 '06 #1

Subscribe Reply

3133

Neil Hodgson

Fuzzyman:

How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ?
Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines

contents = open("C:\\fuzzy .txt", "rb").read( )
contents '\xff\xfeF\x00u \x00z\x00z\x00y \x00\r\x00\n\x0 0E\x00n\x00d\x0 0
\x00o\x00f\x00 \x00l\x00i\x00n \x00e\x00s\x00'

The '\r\x00\n\x00' is a u'\r\n'.
When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.

Neil

Feb 6 '06 #2

Fuzzyman

Neil Hodgson wrote:

Fuzzyman:
How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ?
Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines
>>> contents = open("C:\\fuzzy .txt", "rb").read( )
>>> contents '\xff\xfeF\x00u \x00z\x00z\x00y \x00\r\x00\n\x0 0E\x00n\x00d\x0 0
\x00o\x00f\x00 \x00l\x00i\x00n \x00e\x00s\x00' >>>

The '\r\x00\n\x00' is a u'\r\n'.
> When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.

Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Neil

Feb 6 '06 #3

Neil Hodgson

Fuzzyman:

Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

You can normalise line endings:

x = "a\r\nb\rc\nd\n \re"
y = x.replace("\r\n ", "\n").replace(" \r","\n")
y 'a\nb\nc\nd\n\n e' print y

a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.

Neil

Feb 7 '06 #4

Fuzzyman

Neil Hodgson wrote:

Fuzzyman:
Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.
You can normalise line endings:
>>> x = "a\r\nb\rc\nd\n \re"
>>> y = x.replace("\r\n ", "\n").replace(" \r","\n")
>>> y 'a\nb\nc\nd\n\n e' >>> print y

a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.

Thanks - that works, but replaces *all* instances of '\r' to '\n' -
even if they aren't used as line terminators. (Unlikely perhaps). It
also doesn't tell me what line ending was used.

Apparently files opened in universal mode - 'rU' - have a newline
attribute. That makes it a bit easier. :-)

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Neil

Feb 7 '06 #5

Similar topics

2605

IDLE on Windows; save files as UNIX?

by: Jay O'Connor | last post by:

I'm doing some Python CGI programming under Windows (Win95) but my CGIs need to run on Linux. If I try to write and save the files in IDLE, they get saved in DOS format and won't run on the Linux server. If I load the files into another text editor and explictely "Save As" in UNIX format, they work fine, but the other editor is not Python aware so I'd rather not have to use if for main development, and having to load and resave my scripts...

Python

2156

distutils, 'scripts' and Windows

by: George van den Driessche | last post by:

Hi folks, I'm looking at packaging a project I'm working on using distutils. The project is for Windows and contains a COM server which needs registration, so the installer needs to be a little more complicated than usual. Looking at the options for the bdist_wininst command to distutils, I see it's possible to specify --install-script=<myinstallscript> which ought to do the trick. But to use this, myinstallscript itself must first be...

Python

2988

UTF16 codec doesn't round-trip?

by: John Perks and Sarah Mount | last post by:

(My Python uses UTF16 natively; can someone with UTF32 Python let me know if that behaves differently?) >>> import codecs >>> u'\ud800' # part of surrogate pair u'\ud800' codecs.utf_16_be_encode(_) '\xd8\x00' codecs.utf_16_be_decode(_) Traceback (most recent call last):

Python

4398

regex multiline modifier and windows line endings

by: Allen | last post by:

In regex, ^ and $ shoudl match start/end of a line when the 'm' /multiline modifier is set -- however I just spent the better part of the day trying to figure out why it wasn't working as expected. I used the query test$ ....and the text test test test Only the last instance of test would match -- that is expected without the

PHP

9577

Detecting line endings

by: Fuzzyman | last post by:

Hello all, I'm trying to detect line endings used in text files. I *might* be decoding the files into unicode first (which may be encoded using multi-byte encodings) - which is why I'm not letting Python handle the line endings. Is the following safe and sane : text = open('test.txt', 'rb').read()

Python

4492

Unicode line endings

by: jdbartlett | last post by:

After switching text editors, my code started causing mysterious PHP errors. I narrowed the problem down to the Unicode line endings I started using with the new text editor: when I save documents using unicode line endings, PHP no longer registers the line endings, meaning that: <?php echo "Hello World!";

PHP

7610

The line endings in the following file are not consistent. Do you want to normalize the line endings?

by: jandhondt | last post by:

IN Visual Studio 2005 with VB.NET when I open a solution I often get this warning: The line endings in the following file are not consistent. Do you want to normalize the line endings? The warning occurs on an inherited form. My solution is under Source control with Visual Sourcesafe. No matter if I answer Yes or no, the next time it will still ask this. Does anyone know how to avoid this?

Visual Basic .NET

3949

Printing unix Line endings from Windows.

by: Ant | last post by:

Hi all, I've got a problem here which has me stumped. I've got a python script which does some text processing on some files and writes it back out to the same file using the fileinput module with inplace set to True. The script needs to run from Windows, but the files need to be written with Unix line endings. Is there any way of doing this without having to post-process the file

Python

4224

module: zipfile.writestr - line endings issue

by: towers | last post by:

Hi I'm probably doing something stupid but I've run into a problem whereby I'm trying to add a csv file to a zip archive - see example code below. The csv just has several rows with carriage return line feeds (CRLF). However after adding it to an archive and then decompressing the line endings have been converted to just line feeds (LF).

Python

9620

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9454

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10261

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8934

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7460

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6715

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5482

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3609

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2850

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General