473,408 Members | 1,734 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,408 software developers and data experts.

UTF16, BOM, and Windows Line endings

Hello all,

I'm handling some text files where I don't (necessarily) know the
encoding beforehand. Because I use regular expressions to parse the
text I *must* decode UTF16 encoded text (otherwise the regexes split on
byte boundaries).

I can recognise UTF8 and BOM and remove (but not necessarily decode).
For UTF16 it seems that the Python codec will automatically remove the
BOM. Having detected it (to trigger a decode) is it considered
*invalid* to remove it ? The codec certainly handles the text without a
BOM - I just don't want this part of the code to break later.

Because I don't know the encoding until I've checked for the BOM I have
to read in binary mode. Similarly I have to write in binary mode.

How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ? When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Feb 6 '06 #1
4 3111
Fuzzyman:
How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ?
Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines
contents = open("C:\\fuzzy.txt", "rb").read()
contents '\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x0 0n\x00d\x00
\x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00'


The '\r\x00\n\x00' is a u'\r\n'.
When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.


Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.

Neil
Feb 6 '06 #2

Neil Hodgson wrote:
Fuzzyman:
How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ?
Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines
>>> contents = open("C:\\fuzzy.txt", "rb").read()
>>> contents '\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x0 0n\x00d\x00
\x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00' >>>


The '\r\x00\n\x00' is a u'\r\n'.
> When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.


Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.


Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Neil


Feb 6 '06 #3
Fuzzyman:
Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.


You can normalise line endings:
x = "a\r\nb\rc\nd\n\re"
y = x.replace("\r\n", "\n").replace("\r","\n")
y 'a\nb\nc\nd\n\ne' print y

a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.

Neil
Feb 7 '06 #4

Neil Hodgson wrote:
Fuzzyman:
Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.
You can normalise line endings:
>>> x = "a\r\nb\rc\nd\n\re"
>>> y = x.replace("\r\n", "\n").replace("\r","\n")
>>> y 'a\nb\nc\nd\n\ne' >>> print y

a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.


Thanks - that works, but replaces *all* instances of '\r' to '\n' -
even if they aren't used as line terminators. (Unlikely perhaps). It
also doesn't tell me what line ending was used.

Apparently files opened in universal mode - 'rU' - have a newline
attribute. That makes it a bit easier. :-)

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Neil


Feb 7 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Jay O'Connor | last post by:
I'm doing some Python CGI programming under Windows (Win95) but my CGIs need to run on Linux. If I try to write and save the files in IDLE, they get saved in DOS format and won't run on the Linux...
1
by: George van den Driessche | last post by:
Hi folks, I'm looking at packaging a project I'm working on using distutils. The project is for Windows and contains a COM server which needs registration, so the installer needs to be a little...
1
by: John Perks and Sarah Mount | last post by:
(My Python uses UTF16 natively; can someone with UTF32 Python let me know if that behaves differently?) >>> import codecs >>> u'\ud800' # part of surrogate pair u'\ud800'...
2
by: Allen | last post by:
In regex, ^ and $ shoudl match start/end of a line when the 'm' /multiline modifier is set -- however I just spent the better part of the day trying to figure out why it wasn't working as expected....
18
by: Fuzzyman | last post by:
Hello all, I'm trying to detect line endings used in text files. I *might* be decoding the files into unicode first (which may be encoded using multi-byte encodings) - which is why I'm not...
5
by: jdbartlett | last post by:
After switching text editors, my code started causing mysterious PHP errors. I narrowed the problem down to the Unicode line endings I started using with the new text editor: when I save documents...
1
by: jandhondt | last post by:
IN Visual Studio 2005 with VB.NET when I open a solution I often get this warning: The line endings in the following file are not consistent. Do you want to normalize the line endings? The warning...
6
by: Ant | last post by:
Hi all, I've got a problem here which has me stumped. I've got a python script which does some text processing on some files and writes it back out to the same file using the fileinput module...
3
by: towers | last post by:
Hi I'm probably doing something stupid but I've run into a problem whereby I'm trying to add a csv file to a zip archive - see example code below. The csv just has several rows with carriage...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.