webspider, regexp not working, why?

notnorwegian

url = re.compile(r"^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]
{1}
([\w\-]+\.)+
([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?
(&
\w+=\w+)*)?")

why isnt this url catching something like:

<link rel="alternate" type="application/rss+xml" title="Python
Screencasts"
href="http://www.showmedo.com/latestVideoFeed/rss2.0?
tag=python" />

site = urllib.urlopen("http://www.python.org")
for row in site:
obj = url.search(row)
if obj != None:
print "url: ", obj.group()

i know it works because it can catch
www.hello.com in a txt-file and i can catch emails of websites with
another regexp.

search and match yields the same results.

but when you put something like href= in front of it it doesnt work.

i see now that it has to match the beginning of the row or something,
because:
hi www.google.com
doesnt match but
www.google.com hi
matches.
i though a regexp would search a row/file and when it finds an
occurence report it, so a regexp of "lo" would match in lopez.

Jun 27 '08 #1

Subscribe Post Reply

1353

Reedick, Andrew

-----Original Message-----
From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of
no**********@yahoo.se
Sent: Friday, May 23, 2008 12:43 PM
To: py*********@python.org
Subject: webspider, regexp not working, why?

url = re.compile(r"^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]

search and match yields the same results.

but when you put something like href= in front of it it doesnt work.

a) '^' matches at the beginning of a line. So if 'href=' is at the
beginning of the line...

b) Regexes are hard enough to read as is. (http|ftp|https) is more
readable than ((ht|f)tp(s?).

c) If you're going to parse html/xml then bite the bullet and learn one
of the libraries specifically designed to parse html/xml. Many other
regex gurus have learned this lesson. Myself included. =)

*****

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA621

Jun 27 '08 #2

alex23

On May 24, 3:26 am, "Reedick, Andrew" <jr9...@ATT.COMwrote:

c) If you're going to parse html/xml then bite the bullet and learn one
of the libraries specifically designed to parse html/xml. Many other
regex gurus have learned this lesson. Myself included. =)

Agreed. The BeautifulSoup approach is particularly nice (although not
part of stdlib):

>>import urllib
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen('http://www.python.org/').read()
soup = BeautifulSoup(html)
links = [link['href'] for link in soup('link')]
links[0]

u'http://www.python.org/channews.rdf'

- alex23

Jun 27 '08 #3

Similar topics

Saving search results in a dictionary

by: Lukas Holcik | last post by:

Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could....

Python

Possible regexp bug in 4.1a?

by: Ed Leafe | last post by:

I recently upgraded to 4.1 alpha (MySQL 4.1.0-alpha-standard-log) on my Linux server, and came across a problem with a query that had been working in 3.23 that no longer worked in 4.1a. I've...

MySQL Database

RegExp to validate an email address

by: Andrew DeFaria | last post by:

I was reading my O'Reilly JavaScript The Definitive Guide when I came across RegExp and thought I could tighten up my JavaScript code that checks for a valid email address. Why does the following...

Javascript

RegExp for remove all trailing CrLf's?

by: McKirahan | last post by:

How would I use a regular expression to remove all trailing Carriage Returns and Line Feeds (%0D%0A) from a textarea's value? Thanks in advance. Also, are they any great references for learning...

Javascript

RegExp: How to match on exact string only?

by: Mark Findlay | last post by:

I am trying to figure out how to set up my reg exp search so that the search will only match on the exact word. Here is the current problem code: Word1 = "RealPlayer.exe" Word2 = "Player.exe"...

Javascript

EXSLT and regexp

by: Chris Croughton | last post by:

I'm trying to use the EXSLT regexp package from http://www.exslt.org/regexp/functions/match/index.html (specifically the match function) with the libxml xltproc (which supports EXSLT), but...

.NET Framework

unexpected behaviour for python regexp: caret symbol almost useless?

by: conan | last post by:

This regexp '<widget class=".*" id=".*">' works well with 'grep' for matching lines of the kind <widget class="GtkWindow" id="window1"> on a XML .glade file However that's not true for the...

Python

webspider getting stuck

by: notnorwegian | last post by:

i am writing a simple webspider . how do i avoid getting stuck at something like this: Enter username for W3CACL at www.w3.org: ? i can obv add an if-clause for the specific site but since...

Python

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++