473,320 Members | 1,881 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

cut strings and parse for images

Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url
Perhaps you've some tips how to solve this problems?

regards
Andreas
Jul 18 '05 #1
5 2078
"Andreas Volz" <us**************@brachttal.net> wrote in message
news:20*********************@frodo.mittelerde...
Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">


Check out the urlparse module (in std distribution). For images, you can
provide a default addressing scheme, so you can expand "images/marine.jpg"
relative to the current location.

-- Paul
Jul 18 '05 #2
Andreas Volz wrote:
Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.php">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls:
print url
Perhaps you've some tips how to solve this problems?

from sgmllib import SGMLParser

leach_url = "http://stargus.sourceforge.net/"

class URLLister(SGMLParser):

def reset(self):
SGMLParser.reset(self)
self.urls = []
self.images = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)

def do_img(self, attrs):
"We assume each image *has* a src attribute."
for k, v in attrs:
if k == 'src':
self.images.append(v)
break
if __name__ == "__main__":
import urllib
usock = urllib.urlopen(leach_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
print "URLs:"
for url in parser.urls:
print url
print "IMGs:"
for img in parser.images:
print img

$ python sgml1.py
URLs:
about.php
install.php
features.php
http://www.stratagus.org/
http://www.stratagus.org/
http://www.blizzard.com/
http://sourceforge.net/projects/stargus/
IMGs:
images/stargus_banner.jpg
images/marine.jpg
http://sourceforge.net/sflogo.php?gr...561&amp;type=1

regards
Steve
--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119
Jul 18 '05 #3
Am Mon, 06 Dec 2004 20:36:36 GMT schrieb Paul McGuire:
Check out the urlparse module (in std distribution). For images, you
can provide a default addressing scheme, so you can expand
"images/marine.jpg" relative to the current location.


Ok, this looks good. But I'm a really newbie to python and not able to
create a minimum example. Could you please give me a very small example
how to use urlparse? Or point me to an example in the web?

regards
Andreas
Jul 18 '05 #4
"Andreas Volz" <us**************@brachttal.net> wrote in message
news:20*********************@frodo.mittelerde...
Am Mon, 06 Dec 2004 20:36:36 GMT schrieb Paul McGuire:
Check out the urlparse module (in std distribution). For images, you
can provide a default addressing scheme, so you can expand
"images/marine.jpg" relative to the current location.


Ok, this looks good. But I'm a really newbie to python and not able to
create a minimum example. Could you please give me a very small example
how to use urlparse? Or point me to an example in the web?

regards
Andreas


No problem. Googling for 'python urlparse' gets us immediately to:
http://www.python.org/doc/current/li...-urlparse.html. This online doc
has some examples built into it.

But as a newbie, it would also be good to get comfortable with dir() and
help() and trying simple commands at the >>> Python prompt. If I type the
following at the Python prompt:
import urlparse
help(urlparse) I get almost the same output straight from the Python source.

dir(urlparse) gives me just a list of the global symbol names from the
module, but sometimes that's enough of a clue without reading the whole doc.

Now is where the intrepid Pythonista-to-be uses the Python interactive
prompt and the tried-and-true Python methodology known as "Just Trying Stuff
Out".
url = "http://www.example.com/dir/example.html" # from your example urlparse(url) Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: 'module' object is not callable
(Damn! forgot to prefix with urlparse.) urlparse.urlparse(url) ('http', 'www.example.com', '/dir/example.html', '', '', '') img = "images/marine.jpeg" # also from your example
urlparse.urlparse(img) ('', '', 'images/marine.jpeg', '', '', '')

Now you can start to predict what kind of tuples you'll get back from
urlparse, you can visualize how you might merge the data from the img
fragment and the url fragment. Wait, I didn't read all of the doc - let's
try urljoin! urljoin(url,img) Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'urljoin' is not defined
(Damn! forgot to prefix with urlparse AGAIN!) urlparse.urljoin(url,img)

'http://www.example.com/dir/images/marine.jpeg'

Is this in the ballpark of where you are trying to go?

-- Paul
Give a man a fish and you feed him for a day; give a man a fish every day
and you feed him for the rest of his life.
Jul 18 '05 #5
Am Tue, 07 Dec 2004 00:40:02 GMT schrieb Paul McGuire:
Is this in the ballpark of where you are trying to go?


Yes, thanks. You helped me a lot.

Andreas
Jul 18 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: josh dismukes | last post by:
/// here is the code i'm getting a parse error on the last line of the code which /// is </html> any help will be much appreciated. <?php session_start ();
11
by: Amy G | last post by:
I have seen something about this beofore on this forum, but my google search didn't come up with the answer I am looking for. I have a list of tuples. Each tuple is in the following format: ...
3
by: amber | last post by:
Hello, Is there a simple way to sort a listbox that contains strings that are numbers? My listbox is populated from a database field that has a datatype of string, but typically contains numbers....
12
by: BGP | last post by:
I am working on a WIN32 API app using devc++4992 that will accept Dow Jones/NASDAQ/etc. stock prices as input, parse them, and do things with it. The user can just cut and paste back prices into a...
3
by: Daniel Weinand | last post by:
hello ng, i have a problem and a imho an insufficient method of solution. strings should be sorted by specific text pattern and dispayed in groups. the strings are stored in a db and have the...
5
by: Navid Azimi | last post by:
What's the best way to parse a currency string to a decimal given the possibility of multiple currencies? That is, my numeric string can be ($12.00) or -£12.00; in either case I want -12.00 to be...
6
by: Ben Allen | last post by:
Hi, Im currently getting the error: Parse error: parse error, unexpected T_ENCAPSED_AND_WHITESPACE, expecting T_STRING or T_VARIABLE or T_NUM_STRING in...
9
by: srikanth | last post by:
i have a text file like below, test.txt file (actually my test file file is with 10000 lines but here i tested with 3 lines) 3 06.09.2006 16:37:25 3 06.09.2006 16:40:02 3 06.09.2006 16:42:31...
6
by: =?Utf-8?B?RGF2aWRN?= | last post by:
Hello, I have an XML file generated from a third party application that I would like to parse. Ideally, I plan on having a windows service setup to scan various folders for XML files and parse the...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.