473,837 Members | 1,656 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

cut strings and parse for images

Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.p hp">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = "http://stargus.sourcef orge.net/"

class URLLister(SGMLP arser):
def reset(self):
SGMLParser.rese t(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.exten d(href)

if __name__ == "__main__":
import urllib
usock = urllib.urlopen( leach_url)
parser = URLLister()
parser.feed(uso ck.read())
parser.close()
usock.close()
for url in parser.urls:
print url
Perhaps you've some tips how to solve this problems?

regards
Andreas
Jul 18 '05 #1
5 2107
"Andreas Volz" <us************ **@brachttal.ne t> wrote in message
news:20******** *************@f rodo.mittelerde ...
Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.p hp">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">


Check out the urlparse module (in std distribution). For images, you can
provide a default addressing scheme, so you can expand "images/marine.jpg"
relative to the current location.

-- Paul
Jul 18 '05 #2
Andreas Volz wrote:
Hi,

I used SGMLParser to parse all href's in a html file. Now I need to cut
some strings. For example:

http://www.example.com/dir/example.html

Now I like to cut the string, so that only domain and directory is
left over. Expected result:

http://www.example.com/dir/

I know how to do this in bash programming, but not in python. How could
this be done?

The next problem is not only to extract href's, but also images. A href
is easy:

<a href="install.p hp">Install</a>

But a image is a little harder:

<img class="bild" src="images/marine.jpg">

This is my current example code:

from sgmllib import SGMLParser

leach_url = "http://stargus.sourcef orge.net/"

class URLLister(SGMLP arser):
def reset(self):
SGMLParser.rese t(self)
self.urls = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.exten d(href)

if __name__ == "__main__":
import urllib
usock = urllib.urlopen( leach_url)
parser = URLLister()
parser.feed(uso ck.read())
parser.close()
usock.close()
for url in parser.urls:
print url
Perhaps you've some tips how to solve this problems?

from sgmllib import SGMLParser

leach_url = "http://stargus.sourcef orge.net/"

class URLLister(SGMLP arser):

def reset(self):
SGMLParser.rese t(self)
self.urls = []
self.images = []

def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.exten d(href)

def do_img(self, attrs):
"We assume each image *has* a src attribute."
for k, v in attrs:
if k == 'src':
self.images.app end(v)
break
if __name__ == "__main__":
import urllib
usock = urllib.urlopen( leach_url)
parser = URLLister()
parser.feed(uso ck.read())
parser.close()
usock.close()
print "URLs:"
for url in parser.urls:
print url
print "IMGs:"
for img in parser.images:
print img

$ python sgml1.py
URLs:
about.php
install.php
features.php
http://www.stratagus.org/
http://www.stratagus.org/
http://www.blizzard.com/
http://sourceforge.net/projects/stargus/
IMGs:
images/stargus_banner. jpg
images/marine.jpg
http://sourceforge.net/sflogo.php?gr...561&amp;type=1

regards
Steve
--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119
Jul 18 '05 #3
Am Mon, 06 Dec 2004 20:36:36 GMT schrieb Paul McGuire:
Check out the urlparse module (in std distribution). For images, you
can provide a default addressing scheme, so you can expand
"images/marine.jpg" relative to the current location.


Ok, this looks good. But I'm a really newbie to python and not able to
create a minimum example. Could you please give me a very small example
how to use urlparse? Or point me to an example in the web?

regards
Andreas
Jul 18 '05 #4
"Andreas Volz" <us************ **@brachttal.ne t> wrote in message
news:20******** *************@f rodo.mittelerde ...
Am Mon, 06 Dec 2004 20:36:36 GMT schrieb Paul McGuire:
Check out the urlparse module (in std distribution). For images, you
can provide a default addressing scheme, so you can expand
"images/marine.jpg" relative to the current location.


Ok, this looks good. But I'm a really newbie to python and not able to
create a minimum example. Could you please give me a very small example
how to use urlparse? Or point me to an example in the web?

regards
Andreas


No problem. Googling for 'python urlparse' gets us immediately to:
http://www.python.org/doc/current/li...-urlparse.html. This online doc
has some examples built into it.

But as a newbie, it would also be good to get comfortable with dir() and
help() and trying simple commands at the >>> Python prompt. If I type the
following at the Python prompt:
import urlparse
help(urlparse) I get almost the same output straight from the Python source.

dir(urlparse) gives me just a list of the global symbol names from the
module, but sometimes that's enough of a clue without reading the whole doc.

Now is where the intrepid Pythonista-to-be uses the Python interactive
prompt and the tried-and-true Python methodology known as "Just Trying Stuff
Out".
url = "http://www.example.com/dir/example.html" # from your example urlparse(url) Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: 'module' object is not callable
(Damn! forgot to prefix with urlparse.) urlparse.urlpar se(url) ('http', 'www.example.co m', '/dir/example.html', '', '', '') img = "images/marine.jpeg" # also from your example
urlparse.urlpar se(img) ('', '', 'images/marine.jpeg', '', '', '')

Now you can start to predict what kind of tuples you'll get back from
urlparse, you can visualize how you might merge the data from the img
fragment and the url fragment. Wait, I didn't read all of the doc - let's
try urljoin! urljoin(url,img ) Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'urljoin' is not defined
(Damn! forgot to prefix with urlparse AGAIN!) urlparse.urljoi n(url,img)

'http://www.example.com/dir/images/marine.jpeg'

Is this in the ballpark of where you are trying to go?

-- Paul
Give a man a fish and you feed him for a day; give a man a fish every day
and you feed him for the rest of his life.
Jul 18 '05 #5
Am Tue, 07 Dec 2004 00:40:02 GMT schrieb Paul McGuire:
Is this in the ballpark of where you are trying to go?


Yes, thanks. You helped me a lot.

Andreas
Jul 18 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
2935
by: josh dismukes | last post by:
/// here is the code i'm getting a parse error on the last line of the code which /// is </html> any help will be much appreciated. <?php session_start ();
11
1914
by: Amy G | last post by:
I have seen something about this beofore on this forum, but my google search didn't come up with the answer I am looking for. I have a list of tuples. Each tuple is in the following format: ("data", "moredata", "evenmoredata", "date string") The date string is my concern. This is the date stamp from an email. The problem is that I have a whole bunch of variations when it comes to the format that the date string is in. For example...
3
412
by: amber | last post by:
Hello, Is there a simple way to sort a listbox that contains strings that are numbers? My listbox is populated from a database field that has a datatype of string, but typically contains numbers. (I need to keep them as strings because occasionally they will have a letter after them - 1a for ex). I don't want zeros in front of them, I just want them to sort like this: 1 2
12
5624
by: BGP | last post by:
I am working on a WIN32 API app using devc++4992 that will accept Dow Jones/NASDAQ/etc. stock prices as input, parse them, and do things with it. The user can just cut and paste back prices into a window and hit a button to process it. The information thus enters the program as a char array. Prices can be between $1 and $100, including cents. So we can have prices such as 3.01, 1.56, 11.57, etc. The char array is an alphanumeric...
3
1930
by: Daniel Weinand | last post by:
hello ng, i have a problem and a imho an insufficient method of solution. strings should be sorted by specific text pattern and dispayed in groups. the strings are stored in a db and have the following layout: 1.0.0.0 1.1.0.0 1.1.1.0 1.1.2.0
5
3698
by: Navid Azimi | last post by:
What's the best way to parse a currency string to a decimal given the possibility of multiple currencies? That is, my numeric string can be ($12.00) or -£12.00; in either case I want -12.00 to be returned. I understand that this may be slightly difficult given non-symbol currency strings (F or Kr) but I figured that the CultureInfo should be able to take care of it somehow. The closest solution I came up with, short of iterating through...
6
489
by: Ben Allen | last post by:
Hi, Im currently getting the error: Parse error: parse error, unexpected T_ENCAPSED_AND_WHITESPACE, expecting T_STRING or T_VARIABLE or T_NUM_STRING in /home/midwestm/public_html/support/tutorials.php on line 15. When running the script below, I believe the error has only started to occur since we made the move to a server with global variables switched off. <?php ob_start(); ?>
9
2031
by: srikanth | last post by:
i have a text file like below, test.txt file (actually my test file file is with 10000 lines but here i tested with 3 lines) 3 06.09.2006 16:37:25 3 06.09.2006 16:40:02 3 06.09.2006 16:42:31 i want to read this and output as it looks but iam getting abnormal
6
2980
by: =?Utf-8?B?RGF2aWRN?= | last post by:
Hello, I have an XML file generated from a third party application that I would like to parse. Ideally, I plan on having a windows service setup to scan various folders for XML files and parse the file, then spit out totals. Since I haven't worked with XML too much in C#, I'm trying to develop a structured and easy-to-read way to parse the file. Essentially, I would like to read the file and add the "BatchTktAmountfor any...
0
9685
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10886
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10634
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10277
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7007
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5674
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5853
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
4052
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3126
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.