By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,173 Members | 796 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,173 IT Pros & Developers. It's quick & easy.

Does Python mess with CRLFs?

P: n/a
Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

I wonder if Python rewrites CRLFs when reading a text file with
open/read?

Here's the code:
==========
f = open("content.html", "r")
content = f.read()
f.close()

#BAD
friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
| re.MULTILINE | re.DOTALL)

#GOOD
friends = re.compile('</td></tr></table>',re.IGNORECASE | re.MULTILINE
| re.DOTALL)

m = friends.search(content)
if m:
print "Found"
else:
print "List not found"
==========

Thank you for any tip.
Nov 12 '08 #1
Share this Question
Share on Google+
4 Replies


P: n/a
On Wed, 12 Nov 2008 12:04:07 +0100, Gilles Ganault <no****@nospam.com>
wrote:
>I wonder if Python rewrites CRLFs when reading a text file with
open/read?
For those seeing the same thing, the answer is yes: On Windows, the
code above turns CRLF into LF. I tried "rb" instead of "r", with no
difference.
Nov 12 '08 #2

P: n/a
On Nov 12, 10:04*pm, Gilles Ganault <nos...@nospam.comwrote:
Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

I wonder if Python rewrites CRLFs when reading a text file with
open/read?
Don't wonder; do some very elementary debugging and find out for
yourself.
Here's the code:
==========
f = open("content.html", "r")
content = f.read()
f.close()
Consider inserting
print repr(content)
here.

Nov 12 '08 #3

P: n/a

Gilles Ganault wrote:
On Wed, 12 Nov 2008 12:04:07 +0100, Gilles Ganault <no****@nospam.com>
wrote:
>I wonder if Python rewrites CRLFs when reading a text file with
open/read?

For those seeing the same thing, the answer is yes: On Windows, the
code above turns CRLF into LF. I tried "rb" instead of "r", with no
difference.
Sorry but that is not what's happening. Your problem is not in reading the
file, it's in the regular expression you're using.

Using open with the "rb" flag leaves the file content intact and does not munge newlines
in any way. A read() will return the exact bytes that are in the file.

--irmen
Nov 12 '08 #4

P: n/a

Gilles Ganault wrote:
Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

#BAD
friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
| re.MULTILINE | re.DOTALL)
If you keep running into trouble and you're sure it's related to the newlines,
maybe it helps using the 'whitespace' symbol instead of \r\n in your expression:
re.compile('</td></tr></table>\\s*</div>\\s*', .... )

Other than that, hard to say what's not working as expected without knowing
the exact contents of the "content.html" file you're searching in....

--irmen
Nov 12 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.