473,288 Members | 1,743 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,288 software developers and data experts.

HTMLParser error

Just writing a simple website spider in python, keep getting these
errors, not sure what to do. The problem seems to be in the feed()
function of htmlparser.

Traceback (most recent call last):
File "spider.py", line 38, in <module>
s.crawl(site)
File "spider.py", line 30, in crawl
self.parse(url)
File "spider.py", line 21, in parse
self.feed(urlopen('http://' + page).read())
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/HTMLParser.py", line 107, in feed
self.rawdata = self.rawdata + data
AttributeError: Spider instance has no attribute 'rawdata'

Any ideas of how to fix this? Im using python 2.5.2 on mac osx
Jun 27 '08 #1
8 8501
On May 21, 9:53*am, Dennis Lee Bieber <wlfr...@ix.netcom.comwrote:
On Wed, 21 May 2008 01:18:00 -0700 (PDT), jonbutle...@googlemail.com
declaimed the following in comp.lang.python:
Any ideas of how to fix this? Im using python 2.5.2 on mac osx

* * * * In the absence of minimal runable code reproducing the error
message...

* * * * Did you remember to INITIALIZE the attribute to a null value
somewhere prior to that statement?
--
* * * * Wulfraed * * * *Dennis Lee Bieber * * * * * * * KD6MOG
* * * * wlfr...@ix.netcom.com * * * * * * *wulfr...@bestiaria.com
* * * * * * * * HTTP://wlfraed.home.netcom.com/
* * * * (Bestiaria Support Staff: * * * * * * * web-a...@bestiaria.com)
* * * * * * * * HTTP://www.bestiaria.com/
Its not a variable I set, its one of HTMLParser's inbuilt variables. I
am using it with urlopen to get the source of a website and feed it to
htmlparser.

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

This is the code I am using. I have tested the other modules and they
work fine, but I havn't got a clue how to fix this one.
Jun 27 '08 #2
On May 21, 6:58 pm, jonbutle...@googlemail.com wrote:
Its not a variable I set, its one of HTMLParser's inbuilt variables. I
am using it with urlopen to get the source of a website and feed it to
htmlparser.

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

This is the code I am using. I have tested the other modules and they
work fine, but I havn't got a clue how to fix this one.
You're not providing enough information. Try to post a minimal code
fragment that demonstrates your error; it gives us all a common basis
for discussion.

Is your Spider class a subclass of HTMLParser? Is it over-riding
__init__? If so, is it doing something like:

super(Spider, self).__init__()

If this is your issue, looking at the HTMLParser code you could get
away with just doing the following in __init__:

self.reset()

This appears to be the function that adds the .rawdata attribute.

Ideally, you should use the former super() syntax...you're less
reliant on the implementation of HTMLParser that way.

- alex23
Jun 27 '08 #3
On May 22, 8:18 am, jonbutle...@googlemail.com wrote:
Sorry, im new to both python and newsgroups, this is all pretty
confusing. So I need a line in my __init__ function of my class? The
spider class I made inherits from HTMLParser. Its just using the
feed() function that produces errors though, the rest seems to work
fine.
Let me repeat: it would make this a lot easier if you would paste
actual code.

As you say, your Spider class inherits from HTMLParser, so you need to
make sure that you set it up correctly so that the HTMLParser
functionality you've inherited will work correctly (or work as you
want it to work). If you've added your own __init__ to Spider, then
the __init__ on HTMLParser is no longer called unless you *explicitly*
call it yourself.

Unfortunately, my earlier advice wasn't totally correct... HTMLParser
is an old-style object, whereas super() only works for new-style
objects, I believe. (If you don't know about old- v new-style objects,
see http://docs.python.org/ref/node33.html). So there are a couple of
approaches that should work for you:

class SpiderBroken(HTMLParser):
def __init__(self):
pass # don't do any ancestral setup

class SpiderOldStyle(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)

class SpiderNewStyle(HTMLParser, object):
def __init__(self):
super(SpiderNewStyle, self).__init__()

Python 2.5.1 (r251:54863, May 1 2007, 17:47:05) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>html = open('temp.html','r').read()
from spider import *
sb = SpiderBroken()
sb.feed(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python25\lib\HTMLParser.py", line 107, in feed
self.rawdata = self.rawdata + data
AttributeError: SpiderBroken instance has no attribute 'rawdata'
>>so = SpiderOldStyle()
so.feed(html)
sn = SpiderNewStyle()
sn.feed(html)
The old-style version is probably easiest, so putting this line in
your __init__ should fix your issue:

HTMLParser.__init__(self)

If this still isn't clear, please let me know.

- alex23
Jun 27 '08 #4
On May 22, 2:40*am, alex23 <wuwe...@gmail.comwrote:
On May 22, 8:18 am, jonbutle...@googlemail.com wrote:
Sorry, im new to both python and newsgroups, this is all pretty
confusing. So I need a line in my __init__ function of my class? The
spider class I made inherits from HTMLParser. Its just using the
feed() function that produces errors though, the rest seems to work
fine.

Let me repeat: it would make this a lot easier if you would paste
actual code.

As you say, your Spider class inherits from HTMLParser, so you need to
make sure that you set it up correctly so that the HTMLParser
functionality you've inherited will work correctly (or work as you
want it to work). If you've added your own __init__ to Spider, then
the __init__ on HTMLParser is no longer called unless you *explicitly*
call it yourself.

Unfortunately, my earlier advice wasn't totally correct... HTMLParser
is an old-style object, whereas super() only works for new-style
objects, I believe. (If you don't know about old- v new-style objects,
seehttp://docs.python.org/ref/node33.html). So there are a couple of
approaches that should work for you:

* * class SpiderBroken(HTMLParser):
* * * * def __init__(self):
* * * * * * pass # don't do any ancestral setup

* * class SpiderOldStyle(HTMLParser):
* * * * def __init__(self):
* * * * * * HTMLParser.__init__(self)

* * class SpiderNewStyle(HTMLParser, object):
* * * * def __init__(self):
* * * * * * super(SpiderNewStyle, self).__init__()

Python 2.5.1 (r251:54863, May *1 2007, 17:47:05) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.>>html = open('temp.html','r').read()
>from spider import *
sb = SpiderBroken()
sb.feed(html)

Traceback (most recent call last):
* File "<stdin>", line 1, in <module>
* File "C:\Python25\lib\HTMLParser.py", line 107, in feed
* * self.rawdata = self.rawdata + data
AttributeError: SpiderBroken instance has no attribute 'rawdata'
>so = SpiderOldStyle()
so.feed(html)
sn = SpiderNewStyle()
sn.feed(html)

The old-style version is probably easiest, so putting this line in
your __init__ should fix your issue:

* * HTMLParser.__init__(self)

If this still isn't clear, please let me know.

- alex23
OK, heres what I have so far:

#!/usr/bin/env python
from HTMLParser import HTMLParser
from urllib2 import urlopen, HTTPError

class Spider(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.found = []
self.queue = []

def handle_starttag(self, tag, attrs):
try:
if tag == 'a':
if attrs[0][0] == 'href':
self.queue.append(attrs[0][1])
except HTMLParseError:
print 'Error parsing HTML tags'

def parse(self, page):
try:
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

def crawl(self, site):
self.queue.append(site)
while 1:
try:
url = self.queue.pop(0)
self.parse(url)
except IndexError:
break
self.found.append(url)
return self.found

if __name__ == '__main__':
s = Spider()
site = raw_input("What site would you like to scan? http://")
s.crawl(site)

Still getting very odd errors though, this being the latest:

Traceback (most recent call last):
File "spider.py", line 38, in <module>
s.crawl(site)
File "spider.py", line 30, in crawl
self.parse(url)
File "spider.py", line 21, in parse
self.feed(urlopen('http://' + page).read())
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 399, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 360, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 1107, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib2.py", line 1064, in do_open
h = http_class(host) # will parse host:port
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 639, in __init__
self._set_hostport(host, port)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 651, in _set_hostport
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?

Thanks
Jun 27 '08 #5
On May 22, 6:22 pm, jonbutle...@googlemail.com wrote:
Still getting very odd errors though, this being the latest:

Traceback (most recent call last):
File "spider.py", line 38, in <module>
[...snip...]
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''
Okay. What I did was put some output in your Spider.parse method:

def parse(self, page):
try:
print 'http://' + page
self.feed(urlopen('http://' + page).read())
except HTTPError:
print 'Error getting page source'

And here's the output:
>python spider.py
What site would you like to scan? http://www.google.com
http://www.google.com
http://http://images.google.com.au/imghp?hl=en&tab=wi

The links you're finding on each page already have the protocol
specified. I'd remove the 'http://' addition from parse, and just add
it to 'site' in the main section.

if __name__ == '__main__':
s = Spider()
site = raw_input("What site would you like to scan? http://")
site = 'http://' + site
s.crawl(site)
Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?
You haven't overwritten Spider.__init__. What you're doing every time
you create a Spider object is first get HTMLParser to initialise it as
it would any other HTMLParser object - which is what adds the .rawdata
attribute to each HTMLParser instance - *and then* doing the Spider-
specific initialisation you need.

Here's an abbreviated copy of the actual HTMLParser class featuring
only its __init__ and reset methods:

class HTMLParser(markupbase.ParserBase):
def __init__(self):
"""Initialize and reset this instance."""
self.reset()

def reset(self):
"""Reset this instance. Loses all unprocessed data."""
self.rawdata = ''
self.lasttag = '???'
self.interesting = interesting_normal
markupbase.ParserBase.reset(self)

When you initialise an instance of HTMLParser, it calls its reset
method, which sets rawdata to an empty string, or adds it to the
instance if it doesn't already exist. So when you call
HTMLParser.__init__(self) in Spider.__init__(), it executes the reset
method on the Spider instance, which it inherits from HTMLParser...

Are you familiar with object oriented design at all? If you're not,
let me know and I'll track down some decent intro docs. Inheritance is
a pretty fundamental concept but I don't think I'm doing it justice.
Jun 27 '08 #6
On May 22, 9:59*am, alex23 <wuwe...@gmail.comwrote:
On May 22, 6:22 pm, jonbutle...@googlemail.com wrote:
Still getting very odd errors though, this being the latest:
Traceback (most recent call last):
* File "spider.py", line 38, in <module>
[...snip...]
* * raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''

Okay. What I did was put some output in your Spider.parse method:

* * def parse(self, page):
* * * * try:
* * * * * * print 'http://' + page
* * * * * * self.feed(urlopen('http://' + page).read())
* * * * except HTTPError:
* * * * * * print 'Error getting page source'

And here's the output:

* * >python spider.py
* * What site would you like to scan?http://www.google.com
* *http://www.google.com
* *http://http://images.google.com.au/imghp?hl=en&tab=wi

The links you're finding on each page already have the protocol
specified. I'd remove the 'http://' addition from parse, and just add
it to 'site' in the main section.

* * if __name__ == '__main__':
* * * * s = Spider()
* * * * site = raw_input("What site would you like to scan? http://")
* * * * site = 'http://' + site
* * * * s.crawl(site)
Also could you explain why I needed to add that
HTMLParser.__init__(self) line? Does it matter that I have overwritten
the __init__ function of spider?

You haven't overwritten Spider.__init__. What you're doing every time
you create a Spider object is first get HTMLParser to initialise it as
it would any other HTMLParser object - which is what adds the .rawdata
attribute to each HTMLParser instance - *and then* doing the Spider-
specific initialisation you need.

Here's an abbreviated copy of the actual HTMLParser class featuring
only its __init__ and reset methods:

* * class HTMLParser(markupbase.ParserBase):
* * * * def __init__(self):
* * * * * * """Initialize and reset this instance."""
* * * * * * self.reset()

* * * * def reset(self):
* * * * * * """Reset this instance. *Loses all unprocessed data."""
* * * * * * self.rawdata = ''
* * * * * * self.lasttag = '???'
* * * * * * self.interesting = interesting_normal
* * * * * * markupbase.ParserBase.reset(self)

When you initialise an instance of HTMLParser, it calls its reset
method, which sets rawdata to an empty string, or adds it to the
instance if it doesn't already exist. So when you call
HTMLParser.__init__(self) in Spider.__init__(), it executes the reset
method on the Spider instance, which it inherits from HTMLParser...

Are you familiar with object oriented design at all? If you're not,
let me know and I'll track down some decent intro docs. Inheritance is
a pretty fundamental concept but I don't think I'm doing it justice.
Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.

Thanks
Jun 27 '08 #7
On May 23, 5:06 am, jonbutle...@googlemail.com wrote:
Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.
Ah, okay, I'm really sorry, if I'd known I would've tried to explain
things a little differently :)

Mark Pilgrim's Dive Into Python is a really good place to start:
http://www.diveintopython.org/toc/index.html

For a quick overview of object oriented programming in Python, try:
http://www.freenetpages.co.uk/hp/alan.gauld/
Specifically: http://www.freenetpages.co.uk/hp/ala...d/tutclass.htm

But don't hesitate to ask questions here or even contact me privately
if you'd prefer.
Jun 27 '08 #8
On May 22, 8:20*pm, alex23 <wuwe...@gmail.comwrote:
On May 23, 5:06 am, jonbutle...@googlemail.com wrote:
Nope, this is my first experience with object oriented programming,
only been learning python for a few weeks but it seemed simple enough
to inspire me to be a bit ambitious. If you could hook me up with some
good docs that would be great. I was about to but a book on python,
specifically OO based, but il look at these docs first. I understand
most of the concepts of inheritance, just not ever used them before.

Ah, okay, I'm really sorry, if I'd known I would've tried to explain
things a little differently :)

Mark Pilgrim's Dive Into Python is a really good place to start:http://www..diveintopython.org/toc/index.html

For a quick overview of object oriented programming in Python, try:http://www.freenetpages.co.uk/hp/alan.gauld/
Specifically:http://www.freenetpages.co.uk/hp/ala...d/tutclass.htm

But don't hesitate to ask questions here or even contact me privately
if you'd prefer.

Thanks for the help, sorry for the delayed reply, flew out to detroit
yesterday and the wifi here is rubbish. Will definitely get reading
Dive into Python, and the other article cleared a lot up for me.
Hopefully I wont have these errors any more, if I keep getting them il
get in touch.

Cheers
Jun 27 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Adonis | last post by:
When parsing my html files, I use handle_pi to capture some embedded python code, but I have noticed that in the embedded python code if it contains html, HTMLParser will parse it as well, and thus...
4
by: Kevin T. Ryan | last post by:
Hi all - I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that...
3
by: Valkyrie | last post by:
I've fed some data to the HTML parser constructed by myself. Here is the beginning of the content of the fed data: ===== <!doctype html public "-//W3C//DTD HTML 4.01//EN"...
0
by: mmarkzon | last post by:
I have been struggling compiling linkchecker from http://linkchecker.sourceforge.net/. The last thing I get is "error: command 'gcc' failed with exit status 1" which is not very helpful. This is...
8
by: Lawrence D'Oliveiro | last post by:
I've been using HTMLParser to scrape Web sites. The trouble with this is, there's a lot of malformed HTML out there. Real browsers have to be written to cope gracefully with this, but HTMLParser...
2
by: Mike | last post by:
Hi I'm getting the above message on a server installation (2.4) of Python. I don't get the error using my script on my own machine. I'm trying to use the non-sgmllib one - the standard...
1
by: Kenneth McDonald | last post by:
I'm writing a program that will parse HTML and (mostly) convert it to MediaWiki format. The two Python modules I'm aware of to do this are HTMLParser and htmllib. However, I'm currently...
3
by: globalrev | last post by:
tried all kinds of combos to get this to work. http://docs.python.org/lib/module-HTMLParser.html from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser):
0
by: dbphydb | last post by:
Hi, The below code is doing the following 1. Reading the branch name and the destination from a txt file 2. Parsing thru HTML pages Basically, i want to deploy the build of the branch (supplied...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
by: marcoviolo | last post by:
Dear all, I would like to implement on my worksheet an vlookup dynamic , that consider a change of pivot excel via win32com, from an external excel (without open it) and save the new file into a...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.