473,320 Members | 1,946 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Detecting line endings

Hello all,

I'm trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using
multi-byte encodings) - which is why I'm not letting Python handle the
line endings.

Is the following safe and sane :

text = open('test.txt', 'rb').read()
if encoding:
text = text.decode(encoding)
ending = '\n' # default
if '\r\n' in text:
text = text.replace('\r\n', '\n')
ending = '\r\n'
elif '\n' in text:
ending = '\n'
elif '\r' in text:
text = text.replace('\r', '\n')
ending = '\r'
My worry is that if '\n' *doesn't* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
'\n'`` prematurely ?

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Feb 6 '06 #1
18 9426
Fuzzyman enlightened us with:
My worry is that if '\n' *doesn't* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
'\n'`` prematurely ?


I'd count the number of occurences of '\r\n', '\n' without a preceding
'\r' and '\r' without following '\n', and let the majority decide.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Feb 6 '06 #2

Sybren Stuvel wrote:
Fuzzyman enlightened us with:
My worry is that if '\n' *doesn't* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
'\n'`` prematurely ?
I'd count the number of occurences of '\r\n', '\n' without a preceding
'\r' and '\r' without following '\n', and let the majority decide.


Sounds reasonable, edge cases for small files be damned. :-)

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa


Feb 6 '06 #3

Sybren Stuvel wrote:
Fuzzyman enlightened us with:
My worry is that if '\n' *doesn't* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
'\n'`` prematurely ?
I'd count the number of occurences of '\r\n', '\n' without a preceding
'\r' and '\r' without following '\n', and let the majority decide.


This is what I came up with. As you can see from the docstring, it
attempts to sensible(-ish) things in the event of a tie, or no line
endings at all.

Comments/corrections welcomed. I know the tests aren't very useful
(because they make no *assertions* they won't tell you if it breaks),
but you can see what's going on :

import re
import os

rn = re.compile('\r\n')
r = re.compile('\r(?!\n)')
n = re.compile('(?<!\r)\n')

# Sequence of (regex, literal, priority) for each line ending
line_ending = [(n, '\n', 3), (rn, '\r\n', 2), (r, '\r', 1)]
def find_ending(text, default=os.linesep):
"""
Given a piece of text, use a simple heuristic to determine the line
ending in use.

Returns the value assigned to default if no line endings are found.
This defaults to ``os.linesep``, the native line ending for the
machine.

If there is a tie between two endings, the priority chain is
``'\n', '\r\n', '\r'``.
"""
results = [(len(exp.findall(text)), priority, literal) for
exp, literal, priority in line_ending]
results.sort()
print results
if not sum([m[0] for m in results]):
return default
else:
return results[-1][-1]

if __name__ == '__main__':
tests = [
'hello\ngoodbye\nmy fish\n',
'hello\r\ngoodbye\r\nmy fish\r\n',
'hello\rgoodbye\rmy fish\r',
'hello\rgoodbye\n',
'',
'\r\r\r \n\n',
'\n\n \r\n\r\n',
'\n\n\r \r\r\n',
'\n\r \n\r \n\r',
]
for entry in tests:
print repr(entry)
print repr(find_ending(entry))
print

All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa


Feb 6 '06 #4
Fuzzyman <fu******@gmail.com> wrote:
Hello all,

I'm trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using
Open the file with 'rU' mode, and check the file object's newline
attribute.
My worry is that if '\n' *doesn't* signify a line break on the Mac,


It does, since a few years, since MacOSX is version of Unix to all
practical intents and purposes.
Alex
Feb 7 '06 #5
Fuzzyman enlightened us with:
This is what I came up with. [...] Comments/corrections welcomed.


You could use a little more comments in the code, but apart from that
it looks nice.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Feb 7 '06 #6

Alex Martelli wrote:
Fuzzyman <fu******@gmail.com> wrote:
Hello all,

I'm trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using
Open the file with 'rU' mode, and check the file object's newline
attribute.


Ha, so long as it works with Python 2.2, that makes things a bit
easier.

Rats, I liked that snippet of code (I'm a great fan of list
comprehensions). :-)
My worry is that if '\n' *doesn't* signify a line break on the Mac,


It does, since a few years, since MacOSX is version of Unix to all
practical intents and purposes.


I wondered if that might be the case. I think I've worried about this
more than enough now.

Thanks

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Alex


Feb 7 '06 #7

Alex Martelli wrote:
Fuzzyman <fu******@gmail.com> wrote:
Hello all,

I'm trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using


Open the file with 'rU' mode, and check the file object's newline
attribute.


Do you know if this works for multi-byte encodings ? Do files have
metadata associated with them showing the line-ending in use ?

I suppose I could test this...

All the best,
Fuzzy
My worry is that if '\n' *doesn't* signify a line break on the Mac,


It does, since a few years, since MacOSX is version of Unix to all
practical intents and purposes.
Alex


Feb 7 '06 #8
Alex Martelli wrote:
Fuzzyman <fu******@gmail.com> wrote:

Hello all,

I'm trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using

Open the file with 'rU' mode, and check the file object's newline
attribute.


Do you think it would be sensible to have file.readline in universal
newline support by default?

I just got flummoxed by this issue, working with a (pre-alpha) package
by very experienced Python programmers who sent file.readline to
tokenizer.py without universal newline support. Went on a long (and
educational) journey trying to figure out why my file was not being
processed as expected.

Are there circumstances that it would be sensible to have tokenizer
process files without universal newline support?

The result here was having tokenizer detect indentation inconstancies
that did not exist - in the sense that the files were compiled and ran
fine by Python.exe.

Art
Feb 7 '06 #9
Arthur wrote:
Alex Martelli wrote:

I just got flummoxed by this issue, working with a (pre-alpha) package
by very experienced Python programmers who sent file.readline to
tokenizer.py without universal newline support. Went on a long (and
educational) journey trying to figure out why my file was not being
processed as expected.


For example, the widely used MoinMoin source code colorizer sends files
to tokenizer without universal newline support:

http://aspn.activestate.com/ASPN/Coo...n/Recipe/52298

Is my premise that tokenizer needs universal newline support to be
reliable correct?

What else could put it out of sync with the complier?

Art
Feb 7 '06 #10
On 6 Feb 2006 06:35:14 -0800, "Fuzzyman" <fu******@gmail.com> wrote:
Hello all,

I'm trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using
multi-byte encodings) - which is why I'm not letting Python handle the
line endings.

Is the following safe and sane :

text = open('test.txt', 'rb').read()
if encoding:
text = text.decode(encoding)
ending = '\n' # default
if '\r\n' in text:
text = text.replace('\r\n', '\n')
ending = '\r\n'
elif '\n' in text:
ending = '\n'
elif '\r' in text:
text = text.replace('\r', '\n')
ending = '\r'
My worry is that if '\n' *doesn't* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
'\n'`` prematurely ?

Are you guaranteed that text bodies don't contain escape or quoting
mechanisms for binary data where it would be a mistake to convert
or delete an '\r' ? (E.g., I think XML CDATA might be an example).

Regards,
Bengt Richter
Feb 7 '06 #11
Fuzzyman <fu******@gmail.com> wrote:
...
Open the file with 'rU' mode, and check the file object's newline
attribute.
Do you know if this works for multi-byte encodings ? Do files have


You mean when you open them with the codecs module?
metadata associated with them showing the line-ending in use ?


Not in the filesystems I'm familiar with (they did use to, in
filesystems used on VMS and other ancient OSs, but that was a very long
time ago).
Alex
Feb 7 '06 #12

Bengt Richter wrote:
On 6 Feb 2006 06:35:14 -0800, "Fuzzyman" <fu******@gmail.com> wrote:
Hello all,

I'm trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using
multi-byte encodings) - which is why I'm not letting Python handle the
line endings.

Is the following safe and sane :

text = open('test.txt', 'rb').read()
if encoding:
text = text.decode(encoding)
ending = '\n' # default
if '\r\n' in text:
text = text.replace('\r\n', '\n')
ending = '\r\n'
elif '\n' in text:
ending = '\n'
elif '\r' in text:
text = text.replace('\r', '\n')
ending = '\r'
My worry is that if '\n' *doesn't* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
'\n'`` prematurely ?
Are you guaranteed that text bodies don't contain escape or quoting
mechanisms for binary data where it would be a mistake to convert
or delete an '\r' ? (E.g., I think XML CDATA might be an example).


My personal use case is for reading config files in arbitrary encodings
(so it's not an issue).

How would Python handle opening such files when not in binary mode ?
That may be an issue even on Linux - if you open a windows file and
use splitlines does Python convert '\r\n' to '\n' ? (or does it leave
the extra '\r's in place, which is *different to the behaviour under
windows).

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Regards,
Bengt Richter


Feb 7 '06 #13

Alex Martelli wrote:
Fuzzyman <fu******@gmail.com> wrote:
...
Open the file with 'rU' mode, and check the file object's newline
attribute.


Do you know if this works for multi-byte encodings ? Do files have


You mean when you open them with the codecs module?


No, if I open a UTF16 encoded file in universal mode - will it still
have the correct lineending attribute ?

I can't open with a codec unless an encoding is explicitly supplied. I
still want to detect UTF16 even if the encoding isn't specified.

As I said, I ought to test this... Without metadata I wonder how Python
determines it ?

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
metadata associated with them showing the line-ending in use ?


Not in the filesystems I'm familiar with (they did use to, in
filesystems used on VMS and other ancient OSs, but that was a very long
time ago).
Alex


Feb 7 '06 #14

Arthur wrote:
Arthur wrote: Is my premise that tokenizer needs universal newline support to be
reliable correct?

What else could put it out of sync with the complier?


Anybody out there?

Is my question, and the real world issue that provked it, unclear.

Is the answer too obvious?

Have I made *everybody's* kill list?

Isn't it a prima facie issue if the tokenizer fails in ways
incompatible with what the compiler is seeing?

Is this just easy, and I am making it hard? As I apparently do with
Python more generally.

Art

Feb 7 '06 #15
Fuzzyman <fu******@gmail.com> wrote:
...
I can't open with a codec unless an encoding is explicitly supplied. I
still want to detect UTF16 even if the encoding isn't specified.

As I said, I ought to test this... Without metadata I wonder how Python
determines it ?


It doesn't. Python doesn't even try to guess: nor would any other
sensible programming language.
Alex
Feb 8 '06 #16

Alex Martelli wrote:
Fuzzyman <fu******@gmail.com> wrote:
...
I can't open with a codec unless an encoding is explicitly supplied. I
still want to detect UTF16 even if the encoding isn't specified.

As I said, I ought to test this... Without metadata I wonder how Python
determines it ?
It doesn't. Python doesn't even try to guess: nor would any other
sensible programming language.


Right, so opening in "rU" mode and testing the 'newline' attribute
*won't* work for UTF16 encoded files. (Which was what I was asking.)

I'll have to read, determine encoding, decode, then *either* use my
code to determine line endings *or* use ``splitlines(True)``.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Alex


Feb 8 '06 #17

Alex Martelli wrote:
Fuzzyman <fu******@gmail.com> wrote:
...
Open the file with 'rU' mode, and check the file object's newline
attribute.


Just to confirm, for a UTF16 encoded file, the newlines attribute is
``None``.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Feb 8 '06 #18

Fuzzyman wrote:
Alex Martelli wrote:
Fuzzyman <fu******@gmail.com> wrote:
...
> Open the file with 'rU' mode, and check the file object's newline
> attribute.

Just to confirm, for a UTF16 encoded file, the newlines attribute is
``None``.


Hmmm... having read the documentation, the newlines attribute remains
None until some newlines are encountered. :oops:

I don't think it's technique is any better than mine though. ;-)

Fuzzy
http://www.voidspace.org.uk/python/index.shtml

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml


Feb 8 '06 #19

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Allen | last post by:
In regex, ^ and $ shoudl match start/end of a line when the 'm' /multiline modifier is set -- however I just spent the better part of the day trying to figure out why it wasn't working as expected....
4
by: Fuzzyman | last post by:
Hello all, I'm handling some text files where I don't (necessarily) know the encoding beforehand. Because I use regular expressions to parse the text I *must* decode UTF16 encoded text...
5
by: jdbartlett | last post by:
After switching text editors, my code started causing mysterious PHP errors. I narrowed the problem down to the Unicode line endings I started using with the new text editor: when I save documents...
1
by: jandhondt | last post by:
IN Visual Studio 2005 with VB.NET when I open a solution I often get this warning: The line endings in the following file are not consistent. Do you want to normalize the line endings? The warning...
6
by: Ant | last post by:
Hi all, I've got a problem here which has me stumped. I've got a python script which does some text processing on some files and writes it back out to the same file using the fileinput module...
3
by: towers | last post by:
Hi I'm probably doing something stupid but I've run into a problem whereby I'm trying to add a csv file to a zip archive - see example code below. The csv just has several rows with carriage...
3
by: Andy Fish | last post by:
Hi, I am using .Net 2.0 XslCompiledTransform with output method=text. it seems to be converting all the line endings in the output to cr/lf regardless of how I generate them. for instance,...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.