467,081 Members | 989 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,081 developers. It's quick & easy.

HTMLParser error

Just writing a simple website spider in python, keep getting these
errors, not sure what to do. The problem seems to be in the feed()
function of htmlparser.

Traceback (most recent call last):
File "spider.py", line 38, in <module>
s.crawl(site)
File "spider.py", line 30, in crawl
self.parse(url)
File "spider.py", line 21, in parse
self.feed(urlopen('http://' + page).read())
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/HTMLParser.py", line 107, in feed
self.rawdata = self.rawdata + data
AttributeError: Spider instance has no attribute 'rawdata'

Any ideas of how to fix this? Im using python 2.5.2 on mac osx
Jun 27 '08 #1
  • viewed: 7649
Share:
8 Replies
On May 21, 9:53*am, Dennis Lee Bieber <wlfr...@ix.netcom.comwrote:
On Wed, 21 May 2008 01:18:00 -0700 (PDT), jonbutle...@googlemail.com
declaimed the following in comp.lang.python:
Any ideas of how to fix this? Im using python 2.5.2 on mac osx

* * * * In the absence of minimal runable code reproducing the error
message...

* * * * Did you remember to INITIALIZE the attribute to a null value
somewhere prior to that statement?
--
* * * * Wulfraed * * * *Dennis Lee Bieber * * * * * * * KD6MOG
* * * * wlfr...@ix.netcom.com * * * * * * *wulfr...@bestiaria.com
* * * * * * * * HTTP://wlfraed.home.netcom.com/
* * * * (Bestiaria Support Staff: * * * * * * * web-a...@bestiaria.com)
* * * * * * * * HTTP://www.bestiaria.com/
Its not a variable I set, its one of HTMLParser's inbuilt variables. I
am using it with urlopen to get the source of a website and feed it to
htmlparser.

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

This is the code I am using. I have tested the other modules and they
work fine, but I havn't got a clue how to fix this one.
Jun 27 '08 #2
On May 21, 6:58 pm, jonbutle...@googlemail.com wrote:
Its not a variable I set, its one of HTMLParser's inbuilt variables. I
am using it with urlopen to get the source of a website and feed it to
htmlparser.

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

This is the code I am using. I have tested the other modules and they
work fine, but I havn't got a clue how to fix this one.
You're not providing enough information. Try to post a minimal code
fragment that demonstrates your error; it gives us all a common basis
for discussion.

Is your Spider class a subclass of HTMLParser? Is it over-riding
__init__? If so, is it doing something like:

super(Spider, self).__init__()

If this is your issue, looking at the HTMLParser code you could get
away with just doing the following in __init__:

self.reset()

This appears to be the function that adds the .rawdata attribute.

Ideally, you should use the former super() syntax...you're less
reliant on the implementation of HTMLParser that way.

- alex23
Jun 27 '08 #3
On May 22, 8:18 am, jonbutle...@googlemail.com wrote:
Sorry, im new to both python and newsgroups, this is all pretty
confusing. So I need a line in my __init__ function of my class? The
spider class I made inherits from HTMLParser. Its just using the
feed() function that produces errors though, the rest seems to work
fine.
Let me repeat: it would make this a lot easier if you would paste
actual code.

As you say, your Spider class inherits from HTMLParser, so you need to
make sure that you set it up correctly so that the HTMLParser
functionality you've inherited will work correctly (or work as you
want it to work). If you've added your own __init__ to Spider, then
the __init__ on HTMLParser is no longer called unless you *explicitly*
call it yourself.

Unfortunately, my earlier advice wasn't totally correct... HTMLParser
is an old-style object, whereas super() only works for new-style
objects, I believe. (If you don't know about old- v new-style objects,
see http://docs.python.org/ref/node33.html). So there are a couple of
approaches that should work for you:

class SpiderBroken(HTMLParser):
def __init__(self):
pass # don't do any ancestral setup

class SpiderOldStyle(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)

class SpiderNewStyle(HTMLParser, object):
def __init__(self):
super(SpiderNewStyle, self).__init__()

Python 2.5.1 (r251:54863, May 1 2007, 17:47:05) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>html = open('temp.html','r').read()
from spider import *
sb = SpiderBroken()
sb.feed(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python25\lib\HTMLParser.py", line 107, in feed
self.rawdata = self.rawdata + data
AttributeError: SpiderBroken instance has no attribute 'rawdata'
>>so = SpiderOldStyle()
so.feed(html)
sn = SpiderNewStyle()
sn.feed(html)
The old-style version is probably easiest, so putting this line in
your __init__ should fix your issue:

HTMLParser.__init__(self)

If this still isn't clear, please let me know.

- alex23
Jun 27 '08 #4
On May 22, 2:40*am, alex23 <wuwe...@gmail.comwrote:
On May 22, 8:18 am, jonbutle...@googlemail.com wrote:
Sorry, im new to both python and newsgroups, this is all pretty
confusing. So I need a line in my __init__ function of my class? The
spider class I made inherits from HTMLParser. Its just using the
feed() function that produces errors though, the rest seems to work
fine.

Let me repeat: it would make this a lot easier if you would paste
actual code.

As you say, your Spider class inherits from HTMLParser, so you need to
make sure that you set it up correctly so that the HTMLParser
functionality you've inherited will work correctly (or work as you
want it to work). If you've added your own __init__ to Spider, then
the __init__ on HTMLParser is no longer called unless you *explicitly*
call it yourself.

Unfortunately, my earlier advice wasn't totally correct... HTMLParser
is an old-style object, whereas super() only works for new-style
objects, I believe. (If you don't know about old- v new-style objects,
seehttp://docs.python.org/ref/node33.html). So there are a couple of
approaches that should work for you:

* * class SpiderBroken(HTMLParser):
* * * * def __init__(self):
* * * * * * pass # don't do any ancestral setup

* * class SpiderOldStyle(HTMLParser):
* * * * def __init__(self):
* * * * * * HTMLParser.__init__(self)

* * class SpiderNewStyle(HTMLParser, object):
* * * * def __init__(self):
* * * * * * super(SpiderNewStyle, self).__init__()

Python 2.5.1 (r251:54863, May *1 2007, 17:47:05) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.>>html = open('temp.html','r').read()
>from spider import *
sb = SpiderBroken()
sb.feed(html)

Traceback (most recent call last):
* File "<stdin>", line 1, in <module>
* File "C:\Python25\lib\HTMLParser.py", line 107, in feed
* * self.rawdata = self.rawdata + data
AttributeError: SpiderBroken instance has no attribute 'rawdata'
>so = SpiderOldStyle()
so.feed(html)
sn = SpiderNewStyle()
sn.feed(html)

The old-style version is probably easiest, so putting this line in
your __init__ should fix your issue:

* * HTMLParser.__init__(self)

If this still isn't clear, please let me know.

- alex23
OK, heres what I have so far:

#!/usr/bin/env python
from HTMLParser import HTMLParser
from urllib2 import urlopen, HTTPError

class Spider(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.found = []
self.queue = []

def handle_starttag(self, tag, attrs):
try:
if tag == 'a':
if attrs[0][0] == 'href':
self.queue.append(attrs[0][1])
except HTMLParseError:
print 'Error parsing HTML tags'

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

def crawl(self, site):
self.queue.append(site)
while 1:
try:
url = self.queue.pop(0)
self.parse(url)
except IndexError:
break
self.found.append(url)
return self.found

if __name__ == '__main__':
s = Spider()
site = raw_input("What site would you like to scan? http://")
s.crawl(site)

Still getting very odd errors though, this being the latest:

Traceback (most recent call last):
File "spider.py", line 38, in <module>
s.crawl(site)
File "spider.py", line 30, in crawl
self.parse(url)
File "spider.py", line 21, in parse
self.feed(urlopen('http://' + page).read())
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 399, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 360, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 1107, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 1064, in do_open
h = http_class(host) # will parse host:port
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 639, in __init__
self._set_hostport(host, port)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 651, in _set_hostport
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?

Thanks
Jun 27 '08 #5
On May 22, 6:22 pm, jonbutle...@googlemail.com wrote:
Still getting very odd errors though, this being the latest:

Traceback (most recent call last):
File "spider.py", line 38, in <module>
[...snip...]
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''
Okay. What I did was put some output in your Spider.parse method:

def parse(self, page):
try:
print 'http://' + page
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

And here's the output:
>python spider.py
What site would you like to scan? http://www.google.com
http://www.google.com
http://http://images.google.com.au/imghp?hl=en&tab=wi

The links you're finding on each page already have the protocol
specified. I'd remove the 'http://' addition from parse, and just add
it to 'site' in the main section.

if __name__ == '__main__':
s = Spider()
site = raw_input("What site would you like to scan? http://")
site = 'http://' + site
s.crawl(site)
Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?
You haven't overwritten Spider.__init__. What you're doing every time
you create a Spider object is first get HTMLParser to initialise it as
it would any other HTMLParser object - which is what adds the .rawdata
attribute to each HTMLParser instance - *and then* doing the Spider-
specific initialisation you need.

Here's an abbreviated copy of the actual HTMLParser class featuring
only its __init__ and reset methods:

class HTMLParser(markupbase.ParserBase):
def __init__(self):
"""Initialize and reset this instance."""
self.reset()

def reset(self):
"""Reset this instance. Loses all unprocessed data."""
self.rawdata = ''
self.lasttag = '???'
self.interesting = interesting_normal
markupbase.ParserBase.reset(self)

When you initialise an instance of HTMLParser, it calls its reset
method, which sets rawdata to an empty string, or adds it to the
instance if it doesn't already exist. So when you call
HTMLParser.__init__(self) in Spider.__init__(), it executes the reset
method on the Spider instance, which it inherits from HTMLParser...

Are you familiar with object oriented design at all? If you're not,
let me know and I'll track down some decent intro docs. Inheritance is
a pretty fundamental concept but I don't think I'm doing it justice.
Jun 27 '08 #6
On May 22, 9:59*am, alex23 <wuwe...@gmail.comwrote:
On May 22, 6:22 pm, jonbutle...@googlemail.com wrote:
Still getting very odd errors though, this being the latest:
Traceback (most recent call last):
* File "spider.py", line 38, in <module>
[...snip...]
* * raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

Okay. What I did was put some output in your Spider.parse method:

* * def parse(self, page):
* * * * try:
* * * * * * print 'http://' + page
* * * * * * self.feed(urlopen('http://' + page).read())
* * * * except HTTPError:
* * * * * * print 'Error getting page source'

And here's the output:

* * >python spider.py
* * What site would you like to scan?http://www.google.com
* *http://www.google.com
* *http://http://images.google.com.au/imghp?hl=en&tab=wi

The links you're finding on each page already have the protocol
specified. I'd remove the 'http://' addition from parse, and just add
it to 'site' in the main section.

* * if __name__ == '__main__':
* * * * s = Spider()
* * * * site = raw_input("What site would you like to scan? http://")
* * * * site = 'http://' + site
* * * * s.crawl(site)
Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?

You haven't overwritten Spider.__init__. What you're doing every time
you create a Spider object is first get HTMLParser to initialise it as
it would any other HTMLParser object - which is what adds the .rawdata
attribute to each HTMLParser instance - *and then* doing the Spider-
specific initialisation you need.

Here's an abbreviated copy of the actual HTMLParser class featuring
only its __init__ and reset methods:

* * class HTMLParser(markupbase.ParserBase):
* * * * def __init__(self):
* * * * * * """Initialize and reset this instance."""
* * * * * * self.reset()

* * * * def reset(self):
* * * * * * """Reset this instance. *Loses all unprocessed data."""
* * * * * * self.rawdata = ''
* * * * * * self.lasttag = '???'
* * * * * * self.interesting = interesting_normal
* * * * * * markupbase.ParserBase.reset(self)

When you initialise an instance of HTMLParser, it calls its reset
method, which sets rawdata to an empty string, or adds it to the
instance if it doesn't already exist. So when you call
HTMLParser.__init__(self) in Spider.__init__(), it executes the reset
method on the Spider instance, which it inherits from HTMLParser...

Are you familiar with object oriented design at all? If you're not,
let me know and I'll track down some decent intro docs. Inheritance is
a pretty fundamental concept but I don't think I'm doing it justice.
Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.

Thanks
Jun 27 '08 #7
On May 23, 5:06 am, jonbutle...@googlemail.com wrote:
Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.
Ah, okay, I'm really sorry, if I'd known I would've tried to explain
things a little differently :)

Mark Pilgrim's Dive Into Python is a really good place to start:
http://www.diveintopython.org/toc/index.html

For a quick overview of object oriented programming in Python, try:
http://www.freenetpages.co.uk/hp/alan.gauld/
Specifically: http://www.freenetpages.co.uk/hp/ala...d/tutclass.htm

But don't hesitate to ask questions here or even contact me privately
if you'd prefer.
Jun 27 '08 #8
On May 22, 8:20*pm, alex23 <wuwe...@gmail.comwrote:
On May 23, 5:06 am, jonbutle...@googlemail.com wrote:
Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.

Ah, okay, I'm really sorry, if I'd known I would've tried to explain
things a little differently :)

Mark Pilgrim's Dive Into Python is a really good place to start:http://www..diveintopython.org/toc/index.html

For a quick overview of object oriented programming in Python, try:http://www.freenetpages.co.uk/hp/alan.gauld/
Specifically:http://www.freenetpages.co.uk/hp/ala...d/tutclass.htm

But don't hesitate to ask questions here or even contact me privately
if you'd prefer.

Thanks for the help, sorry for the delayed reply, flew out to detroit
yesterday and the wifi here is rubbish. Will definitely get reading
Dive into Python, and the other article cleared a lot up for me.
Hopefully I wont have these errors any more, if I keep getting them il
get in touch.

Cheers
Jun 27 '08 #9

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Adonis | last post: by
4 posts views Thread by Kevin T. Ryan | last post: by
3 posts views Thread by Valkyrie | last post: by
reply views Thread by mmarkzon@gmail.com | last post: by
8 posts views Thread by Lawrence D'Oliveiro | last post: by
2 posts views Thread by Mike | last post: by
3 posts views Thread by globalrev | last post: by
reply views Thread by dbphydb | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.