HTMLParser error

jonbutler88

Just writing a simple website spider in python, keep getting these
errors, not sure what to do. The problem seems to be in the feed()
function of htmlparser.

Traceback (most recent call last):
File "spider.py", line 38, in <module>
s.crawl(site)
File "spider.py", line 30, in crawl
self.parse(url)
File "spider.py", line 21, in parse
self.feed(urlopen('http://' + page).read())
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/HTMLParser.py", line 107, in feed
self.rawdata = self.rawdata + data
AttributeError: Spider instance has no attribute 'rawdata'

Any ideas of how to fix this? Im using python 2.5.2 on mac osx

Jun 27 '08 #1

Subscribe Post Reply

8522

jonbutler88

On May 21, 9:53*am, Dennis Lee Bieber <wlfr...@ix.netcom.comwrote:

On Wed, 21 May 2008 01:18:00 -0700 (PDT), jonbutle...@googlemail.com
declaimed the following in comp.lang.python:

Any ideas of how to fix this? Im using python 2.5.2 on mac osx

* * * * In the absence of minimal runable code reproducing the error
message...

* * * * Did you remember to INITIALIZE the attribute to a null value
somewhere prior to that statement?
--
* * * * Wulfraed * * * *Dennis Lee Bieber * * * * * * * KD6MOG
* * * * wlfr...@ix.netcom.com * * * * * * *wulfr...@bestiaria.com
* * * * * * * * HTTP://wlfraed.home.netcom.com/
* * * * (Bestiaria Support Staff: * * * * * * * web-a...@bestiaria.com)
* * * * * * * * HTTP://www.bestiaria.com/

Its not a variable I set, its one of HTMLParser's inbuilt variables. I
am using it with urlopen to get the source of a website and feed it to
htmlparser.

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

This is the code I am using. I have tested the other modules and they
work fine, but I havn't got a clue how to fix this one.

Jun 27 '08 #2

alex23

On May 21, 6:58 pm, jonbutle...@googlemail.com wrote:

Its not a variable I set, its one of HTMLParser's inbuilt variables. I
am using it with urlopen to get the source of a website and feed it to
htmlparser.

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

This is the code I am using. I have tested the other modules and they
work fine, but I havn't got a clue how to fix this one.

You're not providing enough information. Try to post a minimal code
fragment that demonstrates your error; it gives us all a common basis
for discussion.

Is your Spider class a subclass of HTMLParser? Is it over-riding
__init__? If so, is it doing something like:

super(Spider, self).__init__()

If this is your issue, looking at the HTMLParser code you could get
away with just doing the following in __init__:

self.reset()

This appears to be the function that adds the .rawdata attribute.

Ideally, you should use the former super() syntax...you're less
reliant on the implementation of HTMLParser that way.

- alex23

Jun 27 '08 #3

alex23

On May 22, 8:18 am, jonbutle...@googlemail.com wrote:

Sorry, im new to both python and newsgroups, this is all pretty
confusing. So I need a line in my __init__ function of my class? The
spider class I made inherits from HTMLParser. Its just using the
feed() function that produces errors though, the rest seems to work
fine.

Let me repeat: it would make this a lot easier if you would paste
actual code.

As you say, your Spider class inherits from HTMLParser, so you need to
make sure that you set it up correctly so that the HTMLParser
functionality you've inherited will work correctly (or work as you
want it to work). If you've added your own __init__ to Spider, then
the __init__ on HTMLParser is no longer called unless you *explicitly*
call it yourself.

Unfortunately, my earlier advice wasn't totally correct... HTMLParser
is an old-style object, whereas super() only works for new-style
objects, I believe. (If you don't know about old- v new-style objects,
see http://docs.python.org/ref/node33.html). So there are a couple of
approaches that should work for you:

class SpiderBroken(HTMLParser):
def __init__(self):
pass # don't do any ancestral setup

class SpiderOldStyle(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)

class SpiderNewStyle(HTMLParser, object):
def __init__(self):
super(SpiderNewStyle, self).__init__()

Python 2.5.1 (r251:54863, May 1 2007, 17:47:05) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.

>>html = open('temp.html','r').read()
from spider import *
sb = SpiderBroken()
sb.feed(html)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python25\lib\HTMLParser.py", line 107, in feed
self.rawdata = self.rawdata + data
AttributeError: SpiderBroken instance has no attribute 'rawdata'

>>so = SpiderOldStyle()
so.feed(html)
sn = SpiderNewStyle()
sn.feed(html)

The old-style version is probably easiest, so putting this line in
your __init__ should fix your issue:

HTMLParser.__init__(self)

If this still isn't clear, please let me know.

- alex23

Jun 27 '08 #4

jonbutler88

On May 22, 2:40*am, alex23 <wuwe...@gmail.comwrote:

On May 22, 8:18 am, jonbutle...@googlemail.com wrote:

Sorry, im new to both python and newsgroups, this is all pretty
confusing. So I need a line in my __init__ function of my class? The
spider class I made inherits from HTMLParser. Its just using the
feed() function that produces errors though, the rest seems to work
fine.

Let me repeat: it would make this a lot easier if you would paste
actual code.

As you say, your Spider class inherits from HTMLParser, so you need to
make sure that you set it up correctly so that the HTMLParser
functionality you've inherited will work correctly (or work as you
want it to work). If you've added your own __init__ to Spider, then
the __init__ on HTMLParser is no longer called unless you *explicitly*
call it yourself.

Unfortunately, my earlier advice wasn't totally correct... HTMLParser
is an old-style object, whereas super() only works for new-style
objects, I believe. (If you don't know about old- v new-style objects,
seehttp://docs.python.org/ref/node33.html). So there are a couple of
approaches that should work for you:

* * class SpiderBroken(HTMLParser):
* * * * def __init__(self):
* * * * * * pass # don't do any ancestral setup

* * class SpiderOldStyle(HTMLParser):
* * * * def __init__(self):
* * * * * * HTMLParser.__init__(self)

* * class SpiderNewStyle(HTMLParser, object):
* * * * def __init__(self):
* * * * * * super(SpiderNewStyle, self).__init__()

Python 2.5.1 (r251:54863, May *1 2007, 17:47:05) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.>>html = open('temp.html','r').read()

>from spider import *
sb = SpiderBroken()
sb.feed(html)

Traceback (most recent call last):
* File "<stdin>", line 1, in <module>
* File "C:\Python25\lib\HTMLParser.py", line 107, in feed
* * self.rawdata = self.rawdata + data
AttributeError: SpiderBroken instance has no attribute 'rawdata'

>so = SpiderOldStyle()
so.feed(html)
sn = SpiderNewStyle()
sn.feed(html)

The old-style version is probably easiest, so putting this line in
your __init__ should fix your issue:

* * HTMLParser.__init__(self)

If this still isn't clear, please let me know.

- alex23

OK, heres what I have so far:

#!/usr/bin/env python
from HTMLParser import HTMLParser
from urllib2 import urlopen, HTTPError

class Spider(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.found = []
self.queue = []

def handle_starttag(self, tag, attrs):
try:
if tag == 'a':
if attrs[0][0] == 'href':
self.queue.append(attrs[0][1])
except HTMLParseError:
print 'Error parsing HTML tags'

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

def crawl(self, site):
self.queue.append(site)
while 1:
try:
url = self.queue.pop(0)
self.parse(url)
except IndexError:
break
self.found.append(url)
return self.found

if __name__ == '__main__':
s = Spider()
site = raw_input("What site would you like to scan? http://")
s.crawl(site)

Still getting very odd errors though, this being the latest:

Traceback (most recent call last):
File "spider.py", line 38, in <module>
s.crawl(site)
File "spider.py", line 30, in crawl
self.parse(url)
File "spider.py", line 21, in parse
self.feed(urlopen('http://' + page).read())
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 399, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 360, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 1107, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 1064, in do_open
h = http_class(host) # will parse host:port
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 639, in __init__
self._set_hostport(host, port)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 651, in _set_hostport
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?

Thanks

Jun 27 '08 #5

alex23

On May 22, 6:22 pm, jonbutle...@googlemail.com wrote:

Still getting very odd errors though, this being the latest:

Traceback (most recent call last):
File "spider.py", line 38, in <module>
[...snip...]
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

Okay. What I did was put some output in your Spider.parse method:

def parse(self, page):
try:
print 'http://' + page
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

And here's the output:

>python spider.py

What site would you like to scan? http://www.google.com
http://www.google.com
http://http://images.google.com.au/imghp?hl=en&tab=wi

The links you're finding on each page already have the protocol
specified. I'd remove the 'http://' addition from parse, and just add
it to 'site' in the main section.

if __name__ == '__main__':
s = Spider()
site = raw_input("What site would you like to scan? http://")
site = 'http://' + site
s.crawl(site)

Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?

You haven't overwritten Spider.__init__. What you're doing every time
you create a Spider object is first get HTMLParser to initialise it as
it would any other HTMLParser object - which is what adds the .rawdata
attribute to each HTMLParser instance - *and then* doing the Spider-
specific initialisation you need.

Here's an abbreviated copy of the actual HTMLParser class featuring
only its __init__ and reset methods:

class HTMLParser(markupbase.ParserBase):
def __init__(self):
"""Initialize and reset this instance."""
self.reset()

def reset(self):
"""Reset this instance. Loses all unprocessed data."""
self.rawdata = ''
self.lasttag = '???'
self.interesting = interesting_normal
markupbase.ParserBase.reset(self)

When you initialise an instance of HTMLParser, it calls its reset
method, which sets rawdata to an empty string, or adds it to the
instance if it doesn't already exist. So when you call
HTMLParser.__init__(self) in Spider.__init__(), it executes the reset
method on the Spider instance, which it inherits from HTMLParser...

Are you familiar with object oriented design at all? If you're not,
let me know and I'll track down some decent intro docs. Inheritance is
a pretty fundamental concept but I don't think I'm doing it justice.

Jun 27 '08 #6

jonbutler88

On May 22, 9:59*am, alex23 <wuwe...@gmail.comwrote:

On May 22, 6:22 pm, jonbutle...@googlemail.com wrote:

Still getting very odd errors though, this being the latest:

Traceback (most recent call last):
* File "spider.py", line 38, in <module>
[...snip...]
* * raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

Okay. What I did was put some output in your Spider.parse method:

* * def parse(self, page):
* * * * try:
* * * * * * print 'http://' + page
* * * * * * self.feed(urlopen('http://' + page).read())
* * * * except HTTPError:
* * * * * * print 'Error getting page source'

And here's the output:

* * >python spider.py
* * What site would you like to scan?http://www.google.com
* *http://www.google.com
* *http://http://images.google.com.au/imghp?hl=en&tab=wi

The links you're finding on each page already have the protocol
specified. I'd remove the 'http://' addition from parse, and just add
it to 'site' in the main section.

* * if __name__ == '__main__':
* * * * s = Spider()
* * * * site = raw_input("What site would you like to scan? http://")
* * * * site = 'http://' + site
* * * * s.crawl(site)

Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?

You haven't overwritten Spider.__init__. What you're doing every time
you create a Spider object is first get HTMLParser to initialise it as
it would any other HTMLParser object - which is what adds the .rawdata
attribute to each HTMLParser instance - *and then* doing the Spider-
specific initialisation you need.

Here's an abbreviated copy of the actual HTMLParser class featuring
only its __init__ and reset methods:

* * class HTMLParser(markupbase.ParserBase):
* * * * def __init__(self):
* * * * * * """Initialize and reset this instance."""
* * * * * * self.reset()

* * * * def reset(self):
* * * * * * """Reset this instance. *Loses all unprocessed data."""
* * * * * * self.rawdata = ''
* * * * * * self.lasttag = '???'
* * * * * * self.interesting = interesting_normal
* * * * * * markupbase.ParserBase.reset(self)

When you initialise an instance of HTMLParser, it calls its reset
method, which sets rawdata to an empty string, or adds it to the
instance if it doesn't already exist. So when you call
HTMLParser.__init__(self) in Spider.__init__(), it executes the reset
method on the Spider instance, which it inherits from HTMLParser...

Are you familiar with object oriented design at all? If you're not,
let me know and I'll track down some decent intro docs. Inheritance is
a pretty fundamental concept but I don't think I'm doing it justice.

Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.

Thanks

Jun 27 '08 #7

alex23

On May 23, 5:06 am, jonbutle...@googlemail.com wrote:

Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.

Ah, okay, I'm really sorry, if I'd known I would've tried to explain
things a little differently :)

Mark Pilgrim's Dive Into Python is a really good place to start:
http://www.diveintopython.org/toc/index.html

For a quick overview of object oriented programming in Python, try:
http://www.freenetpages.co.uk/hp/alan.gauld/
Specifically: http://www.freenetpages.co.uk/hp/ala...d/tutclass.htm

But don't hesitate to ask questions here or even contact me privately
if you'd prefer.

Jun 27 '08 #8

jjbutler88

On May 22, 8:20*pm, alex23 <wuwe...@gmail.comwrote:

On May 23, 5:06 am, jonbutle...@googlemail.com wrote:

Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.

Ah, okay, I'm really sorry, if I'd known I would've tried to explain
things a little differently :)

Mark Pilgrim's Dive Into Python is a really good place to start:http://www..diveintopython.org/toc/index.html

For a quick overview of object oriented programming in Python, try:http://www.freenetpages.co.uk/hp/alan.gauld/
Specifically:http://www.freenetpages.co.uk/hp/ala...d/tutclass.htm

But don't hesitate to ask questions here or even contact me privately
if you'd prefer.

Thanks for the help, sorry for the delayed reply, flew out to detroit
yesterday and the wifi here is rubbish. Will definitely get reading
Dive into Python, and the other article cleared a lot up for me.
Hopefully I wont have these errors any more, if I keep getting them il
get in touch.

Cheers

Jun 27 '08 #9

Similar topics

Question regarding HTMLParser module.

by: Adonis | last post by:

When parsing my html files, I use handle_pi to capture some embedded python code, but I have noticed that in the embedded python code if it contains html, HTMLParser will parse it as well, and thus...

Python

Help w/ HTMLParser lib

by: Kevin T. Ryan | last post by:

Hi all - I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that...

Python

HTMLParser problem

by: Valkyrie | last post by:

I've fed some data to the HTML parser constructed by myself. Here is the beginning of the content of the fed data: ===== <!doctype html public "-//W3C//DTD HTML 4.01//EN"...

Python

error compiling linkchecker

by: mmarkzon | last post by:

I have been struggling compiling linkchecker from http://linkchecker.sourceforge.net/. The last thing I get is "error: command 'gcc' failed with exit status 1" which is not very helpful. This is...

Python

HTMLParser fragility

by: Lawrence D'Oliveiro | last post by:

I've been using HTMLParser to scrape Web sites. The trouble with this is, there's a lot of malformed HTML out there. Real browsers have to be written to cope gracefully with this, but HTMLParser...

Python

ImportError: No module named HTMLParser

by: Mike | last post by:

Hi I'm getting the above message on a server installation (2.4) of Python. I don't get the error using my script on my own machine. I'm trying to use the non-sgmllib one - the standard...

Python

Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.

by: Kenneth McDonald | last post by:

I'm writing a program that will parse HTML and (mostly) convert it to MediaWiki format. The two Python modules I'm aware of to do this are HTMLParser and htmllib. However, I'm currently...

Python

confused by HTMLParser class

by: globalrev | last post by:

tried all kinds of combos to get this to work. http://docs.python.org/lib/module-HTMLParser.html from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser):

Python

help solve error

by: dbphydb | last post by:

Hi, The below code is doing the following 1. Reading the branch name and the destination from a txt file 2. Parsing thru HTML pages Basically, i want to deploy the build of the branch (supplied...

Python

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice