473,405 Members | 2,210 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Getting URL's

How do I get all the URL's in a page?

May 19 '06 #1
5 1162
use
htmlparser or regular expression

May 19 '06 #2
Thanks

May 19 '06 #3
"defcon8" <de*****@gmail.com> wrote in message
news:11**********************@y43g2000cwc.googlegr oups.com...
How do I get all the URL's in a page?


pyparsing comes with a simple example that does this, too.

-- Paul
Download pyparsing at http://sourceforge.net/projects/pyparsing
May 19 '06 #4
it is difficult to get all URL's in a page
you can use sgmllib module to parse html files
can get the standard href .

May 19 '06 #5
"softwindow" <so********@gmail.com> wrote in message
news:11*********************@j73g2000cwa.googlegro ups.com...
it is difficult to get all URL's in a page

<snip>

Is this really so hard?:

-----------------
from pyparsing import Literal,Suppress,CharsNotIn,CaselessLiteral,\
Word,dblQuotedString,alphanums,SkipTo,makeHTMLTags
import urllib

# extract all <a> anchor tags - makeHTMLTags defines a
# fairly robust pair of match patterns, not just "<tag>","</tag>"
linkOpenTag,linkCloseTag = makeHTMLTags("a")
link = linkOpenTag + \
SkipTo(linkCloseTag).setResultsName("body") + \
linkCloseTag.suppress()

# read the HTML source from some random URL
serverListPage = urllib.urlopen( "http://www.google.com" )
htmlText = serverListPage.read()
serverListPage.close()

# use the link grammar to scan the HTML source
for toks,strt,end in link.scanString(htmlText):
print toks.startA.href,"->",toks.body

-----------------
Prints:
/url?sa=p&pref=ig&pval=2&q=http://www.google.com/ig%3Fhl%3Den ->
Personalized Home
https://www.google.com/accounts/Logi...gle.com/&hl=en ->
Sign in
/imghp?hl=en&tab=wi&ie=UTF-8 -> Images
http://groups.google.com/grphp?hl=en&tab=wg&ie=UTF-8 -> Groups
http://news.google.com/nwshp?hl=en&tab=wn&ie=UTF-8 -> News
http://froogle.google.com/frghp?hl=en&tab=wf&ie=UTF-8 -> Froogle
/maphp?hl=en&tab=wl&ie=UTF-8 -> Maps
/intl/en/options/ -> more&nbsp;&raquo;
/advanced_search?hl=en -> Advanced Search
/preferences?hl=en -> Preferences
/language_tools?hl=en -> Language Tools
/intl/en/ads/ -> Advertising&nbsp;Programs
/services/ -> Business Solutions
/intl/en/about.html -> About Google
-- Paul
May 19 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Eyal | last post by:
Hey, I would appriciate if anyone can help on this one: I have a java object/inteface having a method with a boolean parameter. As I'm trying to call this method from a javascript it fails on...
3
by: Hitesh | last post by:
Hi, I am getting the response from another Website by using the HttpHandler in my current site. I am getting the page but all the images on that page are not appearing only placeholder are...
8
by: bryan | last post by:
Is there any way I can get the application path (the one returned by Request.ApplicationPath) in the Application_Start method in Global.asax? Request is not valid there. On a related note, is there...
2
by: Praveen | last post by:
Hi All, I have made a webservice in C# and it works fine in my machine. I ran into a crazy problem when I wanted to deploy it in windows 2003 server. I have run "aspnet_regiis.exe -i" to make...
2
by: MSK | last post by:
Hi, Continued to my earlier post regaring "Breakpoints are not getting hit" , I have comeup with more input this time.. Kindly give me some idea. I am a newbie to .NET, recently I installed...
1
by: jm.suresh | last post by:
In the following code, I could not find out why the set and get methods are not called once I set the property. .... def __init__(self): .... self._color = 12 .... def...
5
by: Archana | last post by:
Hi all, I am having application where i am downloading xml content using webrequest. my code is as below HttpWebRequest lWebRequest = (HttpWebRequest) WebRequest.Create(URL); HttpWebResponse...
33
by: JamesB | last post by:
I am writing a service that monitors when a particular app is started. Works, but I need to get the user who is currently logged in, and of course Environment.UserName returns the service logon...
3
by: tshad | last post by:
I have a file that I converted from VB.Net to C# that works fine in VB.Net when I compile but not in C# using the same libraries. The error I am getting is: PageInit.cs(9,7): error CS0138: A...
0
by: =?Utf-8?B?RmFicml6aW8gQ2lwcmlhbmk=?= | last post by:
I need to access classic ASP intrinsic objects and their properties from a ..net assembly wrapped to COM. The COM .net assembly is then instanciated from a classic ASP page with...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.