469,343 Members | 5,395 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,343 developers. It's quick & easy.

cut strings and parse for images

Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url
Perhaps you've some tips how to solve this problems?

regards
Andreas
Jul 18 '05 #1
5 1935
"Andreas Volz" <us**************@brachttal.net> wrote in message
news:20*********************@frodo.mittelerde...
Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">


Check out the urlparse module (in std distribution). For images, you can
provide a default addressing scheme, so you can expand "images/marine.jpg"
relative to the current location.

-- Paul
Jul 18 '05 #2
Andreas Volz wrote:
Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url
Perhaps you've some tips how to solve this problems?

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):

def reset(self):
SGMLParser.reset(self)
self.urls = []
self.images = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

def do_img(self, attrs):
"We assume each image *has* a src attribute."
for k, v in attrs:
if k == 'src':
self.images.append(v)
break
if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
print "URLs:"
for url in parser.urls:
print url
print "IMGs:"
for img in parser.images:
print img

$ python sgml1.py
URLs:
about.php
install.php
features.php
http://www.stratagus.org/
http://www.stratagus.org/
http://www.blizzard.com/
http://sourceforge.net/projects/stargus/
IMGs:
images/stargus_banner.jpg
images/marine.jpg
http://sourceforge.net/sflogo.php?gr...561&amp;type=1

regards
Steve
--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119
Jul 18 '05 #3
Am Mon, 06 Dec 2004 20:36:36 GMT schrieb Paul McGuire:
Check out the urlparse module (in std distribution). For images, you
can provide a default addressing scheme, so you can expand
"images/marine.jpg" relative to the current location.


Ok, this looks good. But I'm a really newbie to python and not able to
create a minimum example. Could you please give me a very small example
how to use urlparse? Or point me to an example in the web?

regards
Andreas
Jul 18 '05 #4
"Andreas Volz" <us**************@brachttal.net> wrote in message
news:20*********************@frodo.mittelerde...
Am Mon, 06 Dec 2004 20:36:36 GMT schrieb Paul McGuire:
Check out the urlparse module (in std distribution). For images, you
can provide a default addressing scheme, so you can expand
"images/marine.jpg" relative to the current location.


Ok, this looks good. But I'm a really newbie to python and not able to
create a minimum example. Could you please give me a very small example
how to use urlparse? Or point me to an example in the web?

regards
Andreas


No problem. Googling for 'python urlparse' gets us immediately to:
http://www.python.org/doc/current/li...-urlparse.html. This online doc
has some examples built into it.

But as a newbie, it would also be good to get comfortable with dir() and
help() and trying simple commands at the >>> Python prompt. If I type the
following at the Python prompt:
import urlparse
help(urlparse) I get almost the same output straight from the Python source.

dir(urlparse) gives me just a list of the global symbol names from the
module, but sometimes that's enough of a clue without reading the whole doc.

Now is where the intrepid Pythonista-to-be uses the Python interactive
prompt and the tried-and-true Python methodology known as "Just Trying Stuff
Out".
url = "http://www.example.com/dir/example.html" # from your example urlparse(url) Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: 'module' object is not callable
(Damn! forgot to prefix with urlparse.) urlparse.urlparse(url) ('http', 'www.example.com', '/dir/example.html', '', '', '') img = "images/marine.jpeg" # also from your example
urlparse.urlparse(img) ('', '', 'images/marine.jpeg', '', '', '')

Now you can start to predict what kind of tuples you'll get back from
urlparse, you can visualize how you might merge the data from the img
fragment and the url fragment. Wait, I didn't read all of the doc - let's
try urljoin! urljoin(url,img) Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'urljoin' is not defined
(Damn! forgot to prefix with urlparse AGAIN!) urlparse.urljoin(url,img)

'http://www.example.com/dir/images/marine.jpeg'

Is this in the ballpark of where you are trying to go?

-- Paul
Give a man a fish and you feed him for a day; give a man a fish every day
and you feed him for the rest of his life.
Jul 18 '05 #5
Am Tue, 07 Dec 2004 00:40:02 GMT schrieb Paul McGuire:
Is this in the ballpark of where you are trying to go?


Yes, thanks. You helped me a lot.

Andreas
Jul 18 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by josh dismukes | last post: by
11 posts views Thread by Amy G | last post: by
3 posts views Thread by amber | last post: by
3 posts views Thread by Daniel Weinand | last post: by
5 posts views Thread by Navid Azimi | last post: by
6 posts views Thread by Ben Allen | last post: by
6 posts views Thread by =?Utf-8?B?RGF2aWRN?= | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.