Help running BeautifulSoup script

6,596

Expert 4TB

Hi,
I'm very new to Python. Basically all I want to do is run this scipt. But I don't know how to.

My OS is MS Widows XP
Any help is much appreciated.

P.S. I have already installed BeautifulSoup

You'll need to add the Python directory to your PATH environment variable. After that, I'd say de-compress the backup_fotopic.tar.gz file into a folder. Start a DOS shell and cd to that folder and type

backup_fotopic.py

Assuming that Python is installed correctly (.py files have the python icon) the script will run.

Aug 23 '07 #2

You'll need to add the Python directory to your PATH environment variable.

How do I do this?

Thanks

Aug 23 '07 #3

6,596

Expert 4TB

You'll need to add the Python directory to your PATH environment variable.

Right-click My Computer, go to Properties. On the Advanced tab, click the Environment Variables button. In the "User variables for <your name>" list, find the one called PATH. Select it and click the Edit button. Add some like

C:\python24;

(depending on the actual path and version of your python installation)
to the beginning or the line (yours may not have anything in it yet).

Aug 23 '07 #4

Got the PATH variable set, thanks.

When I run the script I get:

Expand|Select|Wrap|Line Numbers

 Traceback (most recent call last):

  File "C:/Python25/backup_fotopic", line 30, in <module>

    title = image_soup.first('title').contents[0]

AttributeError: 'NoneType' object has no attribute 'contents'

How do I fix this?

Thank you for your continued support.

Aug 24 '07 #5

6,596

Expert 4TB

Got the PATH variable set, thanks.

When I run the script I get:

Expand|Select|Wrap|Line Numbers

Traceback (most recent call last):

File "C:/Python25/backup_fotopic", line 30, in <module>

title = image_soup.first('title').contents[0]

AttributeError: 'NoneType' object has no attribute 'contents'

How do I fix this?

Thank you for your continued support.

This is telling you that the result of

Expand|Select|Wrap|Line Numbers

image_soup.first('title')

is None.
My guess is that 'title' is meaningless to the function.

You'll need to include some code and an explanation of what you expect it to do.

Aug 24 '07 #6

Expand|Select|Wrap|Line Numbers

 #! /usr/bin/env python
 
import urllib, string

from BeautifulSoup import BeautifulSoup
 
collections_soup = BeautifulSoup()
 
# Replace the example URL below with the address of the pictures you want to backup

base_url = 'http://andiday.fotopic.net/c1343336.html'
 
f = urllib.urlopen(base_url + '/list_collections.php')

result = f.read()

f.close()

collections_soup.feed(result)

for collection in collections_soup('a'):

    print '>>>' + base_url + collection['href']

    f = urllib.urlopen(base_url + collection['href'])

    result = f.read()

    f.close()

    collection_soup = BeautifulSoup()

    collection_soup.feed(result)

    for thumb in collection_soup('td', {'class' : 'thumbs'}):

        for image in thumb('a'):

            if string.find(image['href'], 'javascript') == -1 and string.find(image['href'], 'title') == -1:

                f = urllib.urlopen(base_url + image['href'])

                result = f.read()

                f.close()

                image_soup = BeautifulSoup()

                image_soup.feed(result)

                title = image_soup.first('title').contents[0]

                filename = string.split(title.string, '.JPG')[0]

                print filename

                for photo_div in image_soup('div', {'class' : 'photo-image'}):

                    for img in photo_div('img'):

                        print img

                        print filename

                        f = urllib.urlopen(img['src'])

                        result = f.read()

                        f.close()
 
            # Replace /tmp/ below with the path to a folder on your hard drive

                        img = open('C:\Downloads' + filename + '.JPG', 'wb+')

                        img.write(result)

                        img.close()

The script is designed to scrape the url (in this case http://andiday.fotopic.net/c1343336.html) and download all the .jpg files from it to a folder on the hard disk (in this case, C:\Downloads)

Aug 24 '07 #7

6,596

Expert 4TB

The script is designed to scrape the url (in this case http://andiday.fotopic.net/c1343336.html) and download all the .jpg files from it to a folder on the hard disk (in this case, C:\Downloads)

When debugging/troubleshooting, alway start at the source.
I'm not too internet savvy, but I'm pretty sure that this says something about "couldn't find the file":

Expand|Select|Wrap|Line Numbers

 
>>> import urllib

>>> f = urllib.urlopen('http://andiday.fotopic.net/c1343336.html/list_collections.php')

>>> f.read()

'\n\n<!-- /export/fotopic.net/userland/www/css/18102.css -->\n\n<link rel="stylesheet" href="http://media.fotopic.net/virtualv1/25/style.css" type="text/css">\n<body>\n\n<table border=0 cellpadding=4 cellspacing=2 width=100%>\n<tr><td class="content">\n<center><h2><div class="photo">404: Page Not found</h2></center>\n<div class="photo">We\'re sorry but we couldn\'t find the file you requested:\n<ul>\n<strong><div class="photo">http://andiday.fotopic.net/c1343336.html/list_collections.php</strong>\n</ul>\n<div class="photo">So either it doesn\'t exist, or it\'s been moved.\n<p>\n\n<div class="photo">If you\'re looking for a particular person\'s gallery, you could try looking at<br/>\nour <a href="http://fotopic.net/community/">Community</a> section, or\nalternatively take a look at <a href="http://fotopic.net/">the main Fotopic\nsite</a>.\n\n<p>\n\n</td></tr>\n</table>\n\n</body>\n'

>>> f.close()

>>> del f

>>> del urllib

>>>

Aug 24 '07 #8

When debugging/troubleshooting, alway start at the source.
I'm not too internet savvy, but I'm pretty sure that this says something about "couldn't find the file":

Expand|Select|Wrap|Line Numbers

>>> import urllib

>>> f = urllib.urlopen('http://andiday.fotopic.net/c1343336.html/list_collections.php')

>>> f.read()

'\n\n\n\n<link rel="stylesheet" href="http://media.fotopic.net/virtualv1/25/style.css" type="text/css">\n<body>\n\n<table border=0 cellpadding=4 cellspacing=2 width=100%>\n<tr><td class="content">\n<center><h2><div class="photo">404: Page Not found</h2></center>\n<div class="photo">We\'re sorry but we couldn\'t find the file you requested:\n<ul>\n<strong><div class="photo">http://andiday.fotopic.net/c1343336.html/list_collections.php</strong>\n</ul>\n<div class="photo">So either it doesn\'t exist, or it\'s been moved.\n<p>\n\n<div class="photo">If you\'re looking for a particular person\'s gallery, you could try looking at<br/>\nour <a href="http://fotopic.net/community/">Community</a> section, or\nalternatively take a look at <a href="http://fotopic.net/">the main Fotopic\nsite</a>.\n\n<p>\n\n</td></tr>\n</table>\n\n</body>\n'

>>> f.close()

>>> del f

>>> del urllib

>>>

What should I do with this code?

Aug 25 '07 #9

6,596

Expert 4TB

What should I do with this code?

Find the address of a valid list_collections.php.

Aug 25 '07 #10

I don't understand :(

Sorry for being so noobish! lol

Aug 26 '07 #11

HTML purifier using BeautifulSoup?

6,596

Expert 4TB

The script is designed to scrape the url (in this case http://andiday.fotopic.net/c1343336.html) and download all the .jpg files from it to a folder on the hard disk (in this case, C:\Downloads)

The script will do what you want, but the message from f.read() is telling you that the address that you are using is not valid. It looks like there is information in that message that may help you find a valid address. Sorry to be so vague, but, as I've said, I'm not a web-scraping kind of developer.

Aug 26 '07 #12

Similar topics

by: Dan Stromberg | last post by:

Has anyone tried to construct an HTML janitor script using BeautifulSoup? My situation: I'm trying to convert a series of web pages from .html to palmdoc format, using plucker, which is...

Help on regular expression match

by: Johnny Lee | last post by:

Hi, I've met a problem in match a regular expression in python. Hope any of you could help me. Here are the details: I have many tags like this: xxx<a href="http://xxx.xxx.xxx" xxx>xxx xxx<a...

BeautifulSoup fetch help

by: ted | last post by:

Hi, I'm using the BeautifulSoup module and having some trouble processing a file. It's not printing what I'm expecting. In the code below, I'm expecting cells with only "bgcolor" attributes to...

scraping nested tables with BeautifulSoup

by: Gonzillaaa | last post by:

I'm trying to get the data on the "Central London Property Price Guide" box at the left hand side of this page http://www.findaproperty.com/regi0018.html I have managed to get the data :) but...

Regular Expression help

by: RunLevelZero | last post by:

I have some data and I need to put it in a list in a particular way. I have that figured out but there is " stuff " in the data that I don't want. Example: 10:00am - 11:00am:</b> <a...

BeautifulSoup error

by: William Xu | last post by:

Hi, all, This piece of code used to work well. i guess the error occurs after some upgrade. >>> import urllib >>> from BeautifulSoup import BeautifulSoup >>> url = 'http://www.google.com'...

Help extracting info from HTML source ..

by: s. d. rose | last post by:

Hello All. I am learning Python, and have never worked with HTML. However, I would like to write a simple script to audit my 100+ Netware servers via their web portal. I was reading Chapter 8...

Some <head> clauses cases BeautifulSoup to choke?

by: Frank Stutzman | last post by:

I've got a simple script that looks like (watch the wrap): --------------------------------------------------- import BeautifulSoup,urllib ifile =...

Help with using findAll() in BeautifulSoup

by: Alexnb | last post by:

Okay, I am not sure if there is a better way of doing this than findAll() but that is how I am doing it right now. I am making an app that screen scapes dictionary.com for definitions. However, I...