468,294 Members | 1,802 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,294 developers. It's quick & easy.

Does Python mess with CRLFs?

Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

I wonder if Python rewrites CRLFs when reading a text file with
open/read?

Here's the code:
==========
f = open("content.html", "r")
content = f.read()
f.close()

#BAD
friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
| re.MULTILINE | re.DOTALL)

#GOOD
friends = re.compile('</td></tr></table>',re.IGNORECASE | re.MULTILINE
| re.DOTALL)

m = friends.search(content)
if m:
print "Found"
else:
print "List not found"
==========

Thank you for any tip.
Nov 12 '08 #1
4 1187
On Wed, 12 Nov 2008 12:04:07 +0100, Gilles Ganault <no****@nospam.com>
wrote:
>I wonder if Python rewrites CRLFs when reading a text file with
open/read?
For those seeing the same thing, the answer is yes: On Windows, the
code above turns CRLF into LF. I tried "rb" instead of "r", with no
difference.
Nov 12 '08 #2
On Nov 12, 10:04*pm, Gilles Ganault <nos...@nospam.comwrote:
Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

I wonder if Python rewrites CRLFs when reading a text file with
open/read?
Don't wonder; do some very elementary debugging and find out for
yourself.
Here's the code:
==========
f = open("content.html", "r")
content = f.read()
f.close()
Consider inserting
print repr(content)
here.

Nov 12 '08 #3

Gilles Ganault wrote:
On Wed, 12 Nov 2008 12:04:07 +0100, Gilles Ganault <no****@nospam.com>
wrote:
>I wonder if Python rewrites CRLFs when reading a text file with
open/read?

For those seeing the same thing, the answer is yes: On Windows, the
code above turns CRLF into LF. I tried "rb" instead of "r", with no
difference.
Sorry but that is not what's happening. Your problem is not in reading the
file, it's in the regular expression you're using.

Using open with the "rb" flag leaves the file content intact and does not munge newlines
in any way. A read() will return the exact bytes that are in the file.

--irmen
Nov 12 '08 #4

Gilles Ganault wrote:
Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

#BAD
friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
| re.MULTILINE | re.DOTALL)
If you keep running into trouble and you're sure it's related to the newlines,
maybe it helps using the 'whitespace' symbol instead of \r\n in your expression:
re.compile('</td></tr></table>\\s*</div>\\s*', .... )

Other than that, hard to say what's not working as expected without knowing
the exact contents of the "content.html" file you're searching in....

--irmen
Nov 12 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

23 posts views Thread by Antoon Pardon | last post: by
12 posts views Thread by fuego | last post: by
22 posts views Thread by stephen.mayer | last post: by
22 posts views Thread by lennart | last post: by
113 posts views Thread by John Nagle | last post: by
reply views Thread by Teichintx | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.