473,772 Members | 3,603 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

UTF16, BOM, and Windows Line endings

Hello all,

I'm handling some text files where I don't (necessarily) know the
encoding beforehand. Because I use regular expressions to parse the
text I *must* decode UTF16 encoded text (otherwise the regexes split on
byte boundaries).

I can recognise UTF8 and BOM and remove (but not necessarily decode).
For UTF16 it seems that the Python codec will automatically remove the
BOM. Having detected it (to trigger a decode) is it considered
*invalid* to remove it ? The codec certainly handles the text without a
BOM - I just don't want this part of the code to break later.

Because I don't know the encoding until I've checked for the BOM I have
to read in binary mode. Similarly I have to write in binary mode.

How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ? When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Feb 6 '06 #1
4 3133
Fuzzyman:
How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ?
Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines
contents = open("C:\\fuzzy .txt", "rb").read( )
contents '\xff\xfeF\x00u \x00z\x00z\x00y \x00\r\x00\n\x0 0E\x00n\x00d\x0 0
\x00o\x00f\x00 \x00l\x00i\x00n \x00e\x00s\x00'


The '\r\x00\n\x00' is a u'\r\n'.
When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.


Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.

Neil
Feb 6 '06 #2

Neil Hodgson wrote:
Fuzzyman:
How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ?
Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines
>>> contents = open("C:\\fuzzy .txt", "rb").read( )
>>> contents '\xff\xfeF\x00u \x00z\x00z\x00y \x00\r\x00\n\x0 0E\x00n\x00d\x0 0
\x00o\x00f\x00 \x00l\x00i\x00n \x00e\x00s\x00' >>>


The '\r\x00\n\x00' is a u'\r\n'.
> When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.


Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.


Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Neil


Feb 6 '06 #3
Fuzzyman:
Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.


You can normalise line endings:
x = "a\r\nb\rc\nd\n \re"
y = x.replace("\r\n ", "\n").replace(" \r","\n")
y 'a\nb\nc\nd\n\n e' print y

a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.

Neil
Feb 7 '06 #4

Neil Hodgson wrote:
Fuzzyman:
Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.
You can normalise line endings:
>>> x = "a\r\nb\rc\nd\n \re"
>>> y = x.replace("\r\n ", "\n").replace(" \r","\n")
>>> y 'a\nb\nc\nd\n\n e' >>> print y

a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.


Thanks - that works, but replaces *all* instances of '\r' to '\n' -
even if they aren't used as line terminators. (Unlikely perhaps). It
also doesn't tell me what line ending was used.

Apparently files opened in universal mode - 'rU' - have a newline
attribute. That makes it a bit easier. :-)

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Neil


Feb 7 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
2605
by: Jay O'Connor | last post by:
I'm doing some Python CGI programming under Windows (Win95) but my CGIs need to run on Linux. If I try to write and save the files in IDLE, they get saved in DOS format and won't run on the Linux server. If I load the files into another text editor and explictely "Save As" in UNIX format, they work fine, but the other editor is not Python aware so I'd rather not have to use if for main development, and having to load and resave my scripts...
1
2156
by: George van den Driessche | last post by:
Hi folks, I'm looking at packaging a project I'm working on using distutils. The project is for Windows and contains a COM server which needs registration, so the installer needs to be a little more complicated than usual. Looking at the options for the bdist_wininst command to distutils, I see it's possible to specify --install-script=<myinstallscript> which ought to do the trick. But to use this, myinstallscript itself must first be...
1
2988
by: John Perks and Sarah Mount | last post by:
(My Python uses UTF16 natively; can someone with UTF32 Python let me know if that behaves differently?) >>> import codecs >>> u'\ud800' # part of surrogate pair u'\ud800' codecs.utf_16_be_encode(_) '\xd8\x00' codecs.utf_16_be_decode(_) Traceback (most recent call last):
2
4398
by: Allen | last post by:
In regex, ^ and $ shoudl match start/end of a line when the 'm' /multiline modifier is set -- however I just spent the better part of the day trying to figure out why it wasn't working as expected. I used the query test$ ....and the text test test test Only the last instance of test would match -- that is expected without the
18
9577
by: Fuzzyman | last post by:
Hello all, I'm trying to detect line endings used in text files. I *might* be decoding the files into unicode first (which may be encoded using multi-byte encodings) - which is why I'm not letting Python handle the line endings. Is the following safe and sane : text = open('test.txt', 'rb').read()
5
4492
by: jdbartlett | last post by:
After switching text editors, my code started causing mysterious PHP errors. I narrowed the problem down to the Unicode line endings I started using with the new text editor: when I save documents using unicode line endings, PHP no longer registers the line endings, meaning that: <?php echo "Hello World!";
1
7610
by: jandhondt | last post by:
IN Visual Studio 2005 with VB.NET when I open a solution I often get this warning: The line endings in the following file are not consistent. Do you want to normalize the line endings? The warning occurs on an inherited form. My solution is under Source control with Visual Sourcesafe. No matter if I answer Yes or no, the next time it will still ask this. Does anyone know how to avoid this?
6
3949
by: Ant | last post by:
Hi all, I've got a problem here which has me stumped. I've got a python script which does some text processing on some files and writes it back out to the same file using the fileinput module with inplace set to True. The script needs to run from Windows, but the files need to be written with Unix line endings. Is there any way of doing this without having to post-process the file
3
4224
by: towers | last post by:
Hi I'm probably doing something stupid but I've run into a problem whereby I'm trying to add a csv file to a zip archive - see example code below. The csv just has several rows with carriage return line feeds (CRLF). However after adding it to an archive and then decompressing the line endings have been converted to just line feeds (LF).
0
9620
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9454
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10261
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8934
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7460
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6715
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5482
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3609
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2850
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.