473,405 Members | 2,261 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

HTMLParser handler_starttag misses lots of tags!

I want to parse an html file and extract my router's IP address. I
wrote this code and I have python 2.3 installed:

#! /usr/bin/env python

import HTMLParser

class HP(HTMLParser.HTMLParser):

def handle_starttag(self, tag, data):
print "tag is %s." % (tag)

def handle_comment(self, data):
print "caught a comment: %s." % (data)

def handle_data(self, data):
if "IP" in data:
print "Caught %s." % data

hp = HP()
out = open('routerstatus.html')
for line in out:
hp.feed(line)
I figured that when I ran this on the html code at the bottom of this
file, it would print every tag, but instead, this is what I got:

tag is html.
tag is head.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is title.
tag is link.
tag is script.
tag is body.
tag is form.

The program seems to take a vacation after the opening form tag. What
am I doing wrong?

Finally, this is the html code I am trying to parse:

<html>

<head>
<meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">
<meta name="generator" content="Adobe GoLive 5">
<META http-equiv='Pragma' CONTENT='no-cache'>
<META HTTP-EQUIV="Cache-Control" CONTENT="no-cache">
<META http-equiv='Refresh' CONTENT='20'>
<title>router form</title>
<link rel="stylesheet" href="form.css">
<script language="javascript" type="text/javascript">
<!-- hide script from old browsers
function loadhelp(num) {

parent.helpframe.document.location.href="help/help"+num+".html"

}
function newwindow(F)
{
if((F.status.value =="checked")||(F.EncapPTelstra.value=="checked")|| (F.EncapAolDhcp.value=="checked"))
window.open('enatherstatus.htm', 'enstatherstatus', 'width=380,height=450,status=yes');
else if((F.EncapPPTP.value =="checked"))
window.open('pptpstatus.htm', 'pptpstatus', 'width=380,height=320,status=yes');
else
window.open('pppoestatus.htm', 'pppoestatus', 'width=380,height=320,status=yes');

}
//-->
</script>
</head>
<body bgcolor="#ffffff" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onload="loadhelp('_SysStatus')">
<form method="POST">
<input type=hidden name=status value=>
<input type="hidden" name=EncapPTelstra value=>
<input type="hidden" name=EncapPPTP value=>
<input type="hidden" name=EncapAolDhcp value=>
<table border="0" cellpadding="0" cellspacing="3" width="100%">
<tr>
<td colspan="2">
<h1>Router Status</h1>
</td>
</tr>
<!-- RULE //-->
<tr>
<td colspan="2">
<img src="img/liteblue.gif" width="100%" height="2" border="0">
</td>
</tr>
<!-- END RULE //-->
<tr>
<td width="60%">
<b>Account Name</b>
</td>
<td width="40%">

</td>
</tr>

<tr>
<td width="60%">
<b>Firmware Version </b>
</td>
<td width="40%">
4.13 Aug 20 2003
</td>
</tr>

<!-- RULE //-->
<tr>
<td colspan="2">
<img src="img/liteblue.gif" width="100%" height="2" border="0">
</td>
</tr>
<!-- END RULE //-->
<tr>
<td colspan="2">
<span class="subhead">Internet Port </span>
</td>
</tr>
<tr>
<td width="60%">
<b>MAC Address </b>
</td>
<td width="40%">
00:09:5b:29:3d:b4
</td>
</tr>
<tr>
<td width="60%">
<b>IP Address </b>
</td>
<td width="40%">
66.72.206.129
</td>
</tr>
<tr>
<td width="60%">
<b>DHCP </b>
</td>
<td width="40%">
None
</td>
</tr>
<tr>
<td width="60%">
<b>IP Subnet Mask </b>
</td>
<td width="40%">
None
</td>
</tr>
<tr>
<td width="60%">
<b>Domain Name Server</b>
</td>
<td width="40%">
66.73.20.40
</td>
</tr>
<tr>
<td width="60%">
<b></b>
</td>
<td width="40%">
206.141.193.55
</td>
</tr>
<!-- RULE //-->
<tr>
<td colspan="2">
<img src="img/liteblue.gif" width="100%" height="2" border="0">
</td>
</tr>
<!-- END RULE //-->
<tr>
<td colspan="2">
<span class="subhead">LAN Port </span>
</td>
</tr>
<tr>
<td width="60%">
<b>MAC Address </b>
</td>
<td width="40%">
00:09:5b:29:3d:b3
</td>
</tr>
<tr>
<td width="60%">
<b>IP Address </b>
</td>
<td width="40%">
192.168.0.1
</td>
</tr>
<tr>
<td width="60%">
<b>DHCP </b>
</td>
<td width="40%">
Server
</td>
</tr>
<tr>
<td width="60%">
<b>IP Subnet Mask </b>
</td>
<td width="40%">
255.255.255.0
</td>
</tr>

</table>
<TABLE border=0 width="100%">
<tr width="100%">
<td>
<img src="img/liteblue.gif" width="100%" height="2" border="0">
</td>
</tr>
<TR width="100%">
<TD>

<span class="subhead">Wireless Port </span>
</TD>
</TR>

</TABLE>

<TABLE width="100%" border=0>

<TR>
<TD width="60%"><b>MAC Address
(BSSID) </b></TD>
<TD width="40%">00:09:5b:29:3d:b3</TD></TR>
</table>
<TABLE width="100%" cellSpacing=2 border=0>
<TD width="60%"><b>Name (SSID)</b></TD>
<TD width="40%">natchieland</TD></tr>
<TD width="60%"><b>Region</b></TD>
<TD width="40%">USA</TD></tr>
<TD width="60%"><b>Channel</b></TD>
<TD width="40%">1</TD></tr>

</table>
<TABLE width="100%" cellSpacing=2 border=0>

<tr>
<td colspan="2">
<img src="img/liteblue.gif" width="100%" height="2" border="0">
</td>
</tr>

<tr>
<td align='center'>
<input type="BUTTON" value="Show Statistics" onclick="window.open('mtenSysStatistics.htm','stat ic','width=500,height=200,status=yes, resizable=yes');">
<INPUT onclick="newwindow(this.form);" type=button value="Connection Status">
</TD>
</tr>
</TABLE>
</form>
</body>

</html>
Jul 18 '05 #1
2 2336
> The program seems to take a vacation after the opening form tag. What
am I doing wrong?
<input type=hidden name=status value=>


I can't believe that this value=-thingy is valid html....

Regards,

Diez

Jul 18 '05 #2
Matthew Wilson wrote:
I want to parse an html file and extract my router's IP address. I
wrote this code and I have python 2.3 installed:

#! /usr/bin/env python

import HTMLParser

class HP(HTMLParser.HTMLParser):

def handle_starttag(self, tag, data):
print "tag is %s." % (tag)

def handle_comment(self, data):
print "caught a comment: %s." % (data)

def handle_data(self, data):
if "IP" in data:
print "Caught %s." % data

hp = HP()
out = open('routerstatus.html')
for line in out:
hp.feed(line)
I figured that when I ran this on the html code at the bottom of this
file, it would print every tag, but instead, this is what I got:

tag is html.
tag is head.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is title.
tag is link.
tag is script.
tag is body.
tag is form.

The program seems to take a vacation after the opening form tag. What
am I doing wrong?


Nothing, but your input file is not valid HTML and seems to puzzle the
parser. I recommend running it through tidy before you feed it to the
parser.

Peter
Jul 18 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Adonis | last post by:
When parsing my html files, I use handle_pi to capture some embedded python code, but I have noticed that in the embedded python code if it contains html, HTMLParser will parse it as well, and thus...
11
by: Sean Cody | last post by:
I'm trying to take a webpage that has a nxn table of entries (bus times) and convert it to a 2D array (list of lists). Initially this was simple but I need to be able to access whole 'columns' of...
4
by: Kevin T. Ryan | last post by:
Hi all - I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that...
1
by: Rajarshi Guha | last post by:
Hi, I have some HTML that looks essentially consists of a series of <div>'s and each <div> having one of two classes (tnt-question or tnt-answer). I'm using HTMLParser to handle the tags as: ...
9
by: florent | last post by:
I'm trying to parse html documents from the web, using the HTMLParser class of the HTMLParser module (python 2.3), but some web documents are not fully valids. When the parser finds an invalid tag,...
1
by: Kenneth McDonald | last post by:
I'm writing a program that will parse HTML and (mostly) convert it to MediaWiki format. The two Python modules I'm aware of to do this are HTMLParser and htmllib. However, I'm currently...
1
by: Just Another Victim of the Ambient Morality | last post by:
HTMLParser is behaving in, what I find to be, strange ways and I would like to better understand what it is doing and why. First, it doesn't appear to translate HTML escape characters. I don't...
8
by: jonbutler88 | last post by:
Just writing a simple website spider in python, keep getting these errors, not sure what to do. The problem seems to be in the feed() function of htmlparser. Traceback (most recent call last):...
3
by: globalrev | last post by:
tried all kinds of combos to get this to work. http://docs.python.org/lib/module-HTMLParser.html from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser):
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.