473,287 Members | 1,741 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,287 software developers and data experts.

webspider, regexp not working, why?

url = re.compile(r"^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]
{1}
([\w\-]+\.)+
([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?
(&
\w+=\w+)*)?")

why isnt this url catching something like:

<link rel="alternate" type="application/rss+xml" title="Python
Screencasts"
href="http://www.showmedo.com/latestVideoFeed/rss2.0?
tag=python" />

site = urllib.urlopen("http://www.python.org")
for row in site:
obj = url.search(row)
if obj != None:
print "url: ", obj.group()

i know it works because it can catch
www.hello.com in a txt-file and i can catch emails of websites with
another regexp.

search and match yields the same results.

but when you put something like href= in front of it it doesnt work.

i see now that it has to match the beginning of the row or something,
because:
hi www.google.com
doesnt match but
www.google.com hi
matches.
i though a regexp would search a row/file and when it finds an
occurence report it, so a regexp of "lo" would match in lopez.
Jun 27 '08 #1
2 1348

-----Original Message-----
From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of
no**********@yahoo.se
Sent: Friday, May 23, 2008 12:43 PM
To: py*********@python.org
Subject: webspider, regexp not working, why?

url = re.compile(r"^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]

search and match yields the same results.

but when you put something like href= in front of it it doesnt work.

a) '^' matches at the beginning of a line. So if 'href=' is at the
beginning of the line...

b) Regexes are hard enough to read as is. (http|ftp|https) is more
readable than ((ht|f)tp(s?).

c) If you're going to parse html/xml then bite the bullet and learn one
of the libraries specifically designed to parse html/xml. Many other
regex gurus have learned this lesson. Myself included. =)

*****

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA621
Jun 27 '08 #2
On May 24, 3:26 am, "Reedick, Andrew" <jr9...@ATT.COMwrote:
c) If you're going to parse html/xml then bite the bullet and learn one
of the libraries specifically designed to parse html/xml. Many other
regex gurus have learned this lesson. Myself included. =)
Agreed. The BeautifulSoup approach is particularly nice (although not
part of stdlib):
>>import urllib
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen('http://www.python.org/').read()
soup = BeautifulSoup(html)
links = [link['href'] for link in soup('link')]
links[0]
u'http://www.python.org/channews.rdf'

- alex23

Jun 27 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Lukas Holcik | last post by:
Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could....
0
by: Ed Leafe | last post by:
I recently upgraded to 4.1 alpha (MySQL 4.1.0-alpha-standard-log) on my Linux server, and came across a problem with a query that had been working in 3.23 that no longer worked in 4.1a. I've...
10
by: Andrew DeFaria | last post by:
I was reading my O'Reilly JavaScript The Definitive Guide when I came across RegExp and thought I could tighten up my JavaScript code that checks for a valid email address. Why does the following...
4
by: McKirahan | last post by:
How would I use a regular expression to remove all trailing Carriage Returns and Line Feeds (%0D%0A) from a textarea's value? Thanks in advance. Also, are they any great references for learning...
6
by: Mark Findlay | last post by:
I am trying to figure out how to set up my reg exp search so that the search will only match on the exact word. Here is the current problem code: Word1 = "RealPlayer.exe" Word2 = "Player.exe"...
0
by: Chris Croughton | last post by:
I'm trying to use the EXSLT regexp package from http://www.exslt.org/regexp/functions/match/index.html (specifically the match function) with the libxml xltproc (which supports EXSLT), but...
4
by: conan | last post by:
This regexp '<widget class=".*" id=".*">' works well with 'grep' for matching lines of the kind <widget class="GtkWindow" id="window1"> on a XML .glade file However that's not true for the...
1
by: notnorwegian | last post by:
i am writing a simple webspider . how do i avoid getting stuck at something like this: Enter username for W3CACL at www.w3.org: ? i can obv add an if-clause for the specific site but since...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: marcoviolo | last post by:
Dear all, I would like to implement on my worksheet an vlookup dynamic , that consider a change of pivot excel via win32com, from an external excel (without open it) and save the new file into a...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.