473,386 Members | 1,842 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Does Python mess with CRLFs?

Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

I wonder if Python rewrites CRLFs when reading a text file with
open/read?

Here's the code:
==========
f = open("content.html", "r")
content = f.read()
f.close()

#BAD
friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
| re.MULTILINE | re.DOTALL)

#GOOD
friends = re.compile('</td></tr></table>',re.IGNORECASE | re.MULTILINE
| re.DOTALL)

m = friends.search(content)
if m:
print "Found"
else:
print "List not found"
==========

Thank you for any tip.
Nov 12 '08 #1
4 1268
On Wed, 12 Nov 2008 12:04:07 +0100, Gilles Ganault <no****@nospam.com>
wrote:
>I wonder if Python rewrites CRLFs when reading a text file with
open/read?
For those seeing the same thing, the answer is yes: On Windows, the
code above turns CRLF into LF. I tried "rb" instead of "r", with no
difference.
Nov 12 '08 #2
On Nov 12, 10:04*pm, Gilles Ganault <nos...@nospam.comwrote:
Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

I wonder if Python rewrites CRLFs when reading a text file with
open/read?
Don't wonder; do some very elementary debugging and find out for
yourself.
Here's the code:
==========
f = open("content.html", "r")
content = f.read()
f.close()
Consider inserting
print repr(content)
here.

Nov 12 '08 #3

Gilles Ganault wrote:
On Wed, 12 Nov 2008 12:04:07 +0100, Gilles Ganault <no****@nospam.com>
wrote:
>I wonder if Python rewrites CRLFs when reading a text file with
open/read?

For those seeing the same thing, the answer is yes: On Windows, the
code above turns CRLF into LF. I tried "rb" instead of "r", with no
difference.
Sorry but that is not what's happening. Your problem is not in reading the
file, it's in the regular expression you're using.

Using open with the "rb" flag leaves the file content intact and does not munge newlines
in any way. A read() will return the exact bytes that are in the file.

--irmen
Nov 12 '08 #4

Gilles Ganault wrote:
Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

#BAD
friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
| re.MULTILINE | re.DOTALL)
If you keep running into trouble and you're sure it's related to the newlines,
maybe it helps using the 'whitespace' symbol instead of \r\n in your expression:
re.compile('</td></tr></table>\\s*</div>\\s*', .... )

Other than that, hard to say what's not working as expected without knowing
the exact contents of the "content.html" file you're searching in....

--irmen
Nov 12 '08 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

23
by: Antoon Pardon | last post by:
I have had a look at the signal module and the example and came to the conclusion that the example wont work if you try to do this in a thread. So is there a chance similar code will work in a...
12
by: fuego | last post by:
My company (http://primedia.com/divisions/businessinformation/) has two job openings that we're having a heckuva time filling. We've posted at Monster, Dice, jobs.perl.org and python.jobmart.com. ...
22
by: stephen.mayer | last post by:
Anyone know which is faster? I'm a PHP programmer but considering getting into Python ... did searches on Google but didn't turn much up on this. Thanks! Stephen
10
by: Elric02 | last post by:
I'm currently trying to get access to the Python source code, however whenever I try to extract the files using the latest version of WinZip (version 10) I get the following error "error reading...
22
by: lennart | last post by:
Hi, I'm planning to learn a language for 'client' software. Until now, i 'speak' only some web based languages, like php. As a kid i programmed in Basic (CP/M, good old days :'-) ) Now i want to...
34
by: Ben Sizer | last post by:
I've installed several different versions of Python across several different versions of MS Windows, and not a single time was the Python directory or the Scripts subdirectory added to the PATH...
113
by: John Nagle | last post by:
The major complaint I have about Python is that the packages which connect it to other software components all seem to have serious problems. As long as you don't need to talk to anything outside...
6
by: | last post by:
Hi, I used extensively python and now I find this mess with strings, I can't even reproduce tutorial examples: File "<stdin>", line 0 ^ SyntaxError: 'ascii' codec can't decode byte 0xc4 in...
162
by: Sh4wn | last post by:
Hi, first, python is one of my fav languages, and i'll definitely keep developing with it. But, there's 1 one thing what I -really- miss: data hiding. I know member vars are private when you...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.