UTF16, BOM, and Windows Line endings

Fuzzyman

Hello all,

I'm handling some text files where I don't (necessarily) know the
encoding beforehand. Because I use regular expressions to parse the
text I *must* decode UTF16 encoded text (otherwise the regexes split on
byte boundaries).

I can recognise UTF8 and BOM and remove (but not necessarily decode).
For UTF16 it seems that the Python codec will automatically remove the
BOM. Having detected it (to trigger a decode) is it considered
*invalid* to remove it ? The codec certainly handles the text without a
BOM - I just don't want this part of the code to break later.

Because I don't know the encoding until I've checked for the BOM I have
to read in binary mode. Similarly I have to write in binary mode.

How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ? When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Feb 6 '06 #1

Subscribe Post Reply

3111

Neil Hodgson

Fuzzyman:

How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ?
Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines

contents = open("C:\\fuzzy.txt", "rb").read()
contents '\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x0 0n\x00d\x00
\x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00'

The '\r\x00\n\x00' is a u'\r\n'.
When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.

Neil

Feb 6 '06 #2

Fuzzyman

Neil Hodgson wrote:

Fuzzyman:
How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ?
Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines
>>> contents = open("C:\\fuzzy.txt", "rb").read()
>>> contents '\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x0 0n\x00d\x00
\x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00' >>>

The '\r\x00\n\x00' is a u'\r\n'.
> When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.

Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Neil

Feb 6 '06 #3

Neil Hodgson

Fuzzyman:

Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

You can normalise line endings:

x = "a\r\nb\rc\nd\n\re"
y = x.replace("\r\n", "\n").replace("\r","\n")
y 'a\nb\nc\nd\n\ne' print y

a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.

Neil

Feb 7 '06 #4

Fuzzyman

Neil Hodgson wrote:

Fuzzyman:
Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.
You can normalise line endings:
>>> x = "a\r\nb\rc\nd\n\re"
>>> y = x.replace("\r\n", "\n").replace("\r","\n")
>>> y 'a\nb\nc\nd\n\ne' >>> print y

a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.

Thanks - that works, but replaces *all* instances of '\r' to '\n' -
even if they aren't used as line terminators. (Unlikely perhaps). It
also doesn't tell me what line ending was used.

Apparently files opened in universal mode - 'rU' - have a newline
attribute. That makes it a bit easier. :-)

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Neil

Feb 7 '06 #5

Similar topics

IDLE on Windows; save files as UNIX?

by: Jay O'Connor | last post by:

I'm doing some Python CGI programming under Windows (Win95) but my CGIs need to run on Linux. If I try to write and save the files in IDLE, they get saved in DOS format and won't run on the Linux...

Python

distutils, 'scripts' and Windows

by: George van den Driessche | last post by:

Hi folks, I'm looking at packaging a project I'm working on using distutils. The project is for Windows and contains a COM server which needs registration, so the installer needs to be a little...

Python

UTF16 codec doesn't round-trip?

by: John Perks and Sarah Mount | last post by:

(My Python uses UTF16 natively; can someone with UTF32 Python let me know if that behaves differently?) >>> import codecs >>> u'\ud800' # part of surrogate pair u'\ud800'...

Python

regex multiline modifier and windows line endings

by: Allen | last post by:

In regex, ^ and $ shoudl match start/end of a line when the 'm' /multiline modifier is set -- however I just spent the better part of the day trying to figure out why it wasn't working as expected....

PHP

Detecting line endings

by: Fuzzyman | last post by:

Hello all, I'm trying to detect line endings used in text files. I *might* be decoding the files into unicode first (which may be encoded using multi-byte encodings) - which is why I'm not...

Python

Unicode line endings

by: jdbartlett | last post by:

After switching text editors, my code started causing mysterious PHP errors. I narrowed the problem down to the Unicode line endings I started using with the new text editor: when I save documents...

PHP

The line endings in the following file are not consistent. Do you want to normalize the line endings?

by: jandhondt | last post by:

IN Visual Studio 2005 with VB.NET when I open a solution I often get this warning: The line endings in the following file are not consistent. Do you want to normalize the line endings? The warning...

Visual Basic .NET

Printing unix Line endings from Windows.

by: Ant | last post by:

Hi all, I've got a problem here which has me stumped. I've got a python script which does some text processing on some files and writes it back out to the same file using the fileinput module...

Python

module: zipfile.writestr - line endings issue

by: towers | last post by:

Hi I'm probably doing something stupid but I've run into a problem whereby I'm trying to add a csv file to a zip archive - see example code below. The csv just has several rows with carriage...

Python

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp