473,231 Members | 1,680 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,231 software developers and data experts.

(htmllib) How to capture text that includes tags?

I'm trying to parse a product catalog written in HTML. Some of the
information I need are attributes of tags (like the product name, which
is in an anchor). Some (like product description) are between tags
(in the case of product description, the tag is font).

To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end() only
includes text that isn't marked up. For example, this product
description:

<font size="1">
This rectangle measures 7&quot; x 3&quot;.
</font>

Drops the quotation marks, resulting in:

This rectangle mesaures 7 x 3.

I've been looking through Google Groups but haven't found a way to get
the markup in between the tags. Any suggestions?

This is relevant portion of the class I'm using so far:

class myHTMLParser(htmllib.HTMLParser):

def __init__(self,f):
htmllib.HTMLParser.__init__(self, f)
self

def start_font(self, attrs):
self.save_bgn()

def end_font(self):
text = self.save_end()
if text:
if re.search("\\.\\s*$", text):
print "Probably a product description: " + text

# I needed to override save_end because it was having trouble
# when data was nothing.

def save_end(self):
"""Ends buffering character data and returns all data saved since
the preceding call to the save_bgn() method.

If the nofill flag is false, whitespace is collapsed to single
spaces. A call to this method without a preceding call to the
save_bgn() method will raise a TypeError exception.

"""
data = self.savedata
self.savedata = None
if data:
if not self.nofill:
data = ' '.join(data.split())
return data

Thanks!

Jen

Jul 18 '05 #1
7 3330
jennyw wrote:
I'm trying to parse a product catalog written in HTML. Some of the
information I need are attributes of tags (like the product name,
which is in an anchor). Some (like product description) are between
tags (in the case of product description, the tag is font).

To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end()
only
includes text that isn't marked up. For example, this product
description:

<font size="1">
This rectangle measures 7&quot; x 3&quot;.
</font>

Drops the quotation marks, resulting in:

This rectangle mesaures 7 x 3.


And whats the problem? HTML code produced by broken software like
Frontpage often contains unnecessary quotes - why do you wont to
preserve this crap?

If you want to escape special characters you can use
xml.sax.saxutils.escape() or just write your own function (escape is
only a two liner).

Mathias
Jul 18 '05 #2
jennyw wrote:
I'm trying to parse a product catalog written in HTML. Some of the
information I need are attributes of tags (like the product name, which
is in an anchor). Some (like product description) are between tags
(in the case of product description, the tag is font).

To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end() only
includes text that isn't marked up. For example, this product
description:

<font size="1">
This rectangle measures 7&quot; x 3&quot;.
</font>

Drops the quotation marks, resulting in:

This rectangle mesaures 7 x 3.

I've been looking through Google Groups but haven't found a way to get
the markup in between the tags. Any suggestions?

This is relevant portion of the class I'm using so far:

class myHTMLParser(htmllib.HTMLParser):

def __init__(self,f):
htmllib.HTMLParser.__init__(self, f)
self

def start_font(self, attrs):
self.save_bgn()

def end_font(self):
text = self.save_end()
if text:
if re.search("\\.\\s*$", text):
print "Probably a product description: " + text

# I needed to override save_end because it was having trouble
# when data was nothing.

def save_end(self):
"""Ends buffering character data and returns all data saved since
the preceding call to the save_bgn() method.

If the nofill flag is false, whitespace is collapsed to single
spaces. A call to this method without a preceding call to the
save_bgn() method will raise a TypeError exception.

"""
data = self.savedata
self.savedata = None
if data:
if not self.nofill:
data = ' '.join(data.split())
return data

Thanks!

Jen


I've found the parser in the HTMLParser module to be a lot easier to use.
Below is the rough equivalent of your posted code. In the general case you
will want to keep a stack of tags instead of the simple infont flag.

import HTMLParser, htmlentitydefs

class CatalogParser(HTMLParser.HTMLParser):
entitydefs = htmlentitydefs.entitydefs

def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.infont = False
self.text = []

def handle_starttag(self, tag, atts):
if tag == "font":
assert not self.infont
self.infont = True

def handle_entityref(self, name):
if self.infont:
self.handle_data(self.entitydefs.get(name, "?"))

def handle_data(self, data):
if self.infont:
self.text.append(data)

def handle_endtag(self, tag):
if tag == "font":
assert self.infont
self.infont = False
if self.text:
print "".join(self.text)

data = """
<html>
<body>
<h1>&quot;Ignore me&quot;</h1>
<font size="1">
This &wuerg; rectangle measures 7&quot; x 3&quot;.
</font>
</body>
</html>
"""
p = CatalogParser()
p.feed(data)
p.close()

Peter

Jul 18 '05 #3
I've generally found that trying to parse the whole page with
regexps isn't appropriate. Here's a class that I use sometimes.
Basically you do something like

b = buf(urllib.urlopen(url).read())

and then search around for patterns you expect to find in the page:

b.search("name of the product")
b.rsearch('<a href="')
href = b.up_to('"')

Note that there's an esearch method that lets you do forward searches
for regexps (defaults to case independent since that's usually what
you want for html). But unfortunately, due to a deficiency in the Python
library, there's no simple way to implement backwards regexp searches.

Maybe I'll clean up the interface for this thing sometime.

================================================== ==============

import re

class buf:
def __init__(self, text=''):
self.buf = text
self.point = 0
self.stack = []

def seek(self, offset, whence='set'):
if whence=='set':
self.point = offset
elif whence=='cur':
self.point += offset
elif whence=='end':
self.point = len(self.buf) - offset
else:
raise ValueError, "whence must be one of ('set','cur','end')"

def save(self):
self.stack.append(self.point)

def restore(self):
self.point = self.stack.pop()

def search(self, str):
p = self.buf.index(str, self.point)
self.point = p + len(str)
return self.point

def esearch(self, pat, *opts):
opts = opts or [re.I]
p = re.compile(pat, *opts)
g = p.search(self.buf, self.point)
self.point = g.end()
return self.point

def rsearch(self, str):
p = self.buf.rindex(str, 0, self.point)
self.point = p
return self.point

def up_to(self, str):
a = self.point
b = self.search(str)
return self.buf[a:b-1]
Jul 18 '05 #4
On Wed, Nov 05, 2003 at 11:23:36AM +0100, Peter Otten wrote:
I've found the parser in the HTMLParser module to be a lot easier to use.
Below is the rough equivalent of your posted code. In the general case you
will want to keep a stack of tags instead of the simple infont flag.


Thanks! Whare are the main advantages of HTMLParser over htmllib?

The code gives me something to think about ... it doesn't work right now
because it turns out there are nested font tags (which means the asserts
fail, and if I comment them out, it generates a 53 MB file from a < 1 MB
source file). I'll try playing with it and seeing if I can get it to do
what I want.

It would be easier if I could find a way to view the HTML as a tree ...
as a side note, are there any good utils to do this?

Thanks again!

Jen

Jul 18 '05 #5
jennyw wrote:
On Wed, Nov 05, 2003 at 11:23:36AM +0100, Peter Otten wrote:
I've found the parser in the HTMLParser module to be a lot easier to use.
Below is the rough equivalent of your posted code. In the general case
you will want to keep a stack of tags instead of the simple infont flag.
Thanks! Whare are the main advantages of HTMLParser over htmllib?


Basically htmllib.HTMLParser feeds a formatter that I don't need with
information that I would rather disregard.
HTMLParser.HTMLParser, on the other hand, has a simple interface (you've
pretty much seen it all in my tiny example).
The code gives me something to think about ... it doesn't work right now
because it turns out there are nested font tags (which means the asserts
fail, and if I comment them out, it generates a 53 MB file from a < 1 MB
source file). I'll try playing with it and seeing if I can get it to do
what I want.
I would suspect that there are <font> tags without a corresponding </font>.
You could fix that by preprocessing the html source with a tool like tidy.
As an aside, font tags as search criteria are as bad as you can get. Try to
find something more specific, e. g. the "second column in every row of the
first table". If this gets too complex for HTMLParser, you can instead
convert the html into xml (again via tidy) and then read it into a dom
tree.
It would be easier if I could find a way to view the HTML as a tree ...
as a side note, are there any good utils to do this?


I've never applied this primitive data extraction technique to large complex
html files, so for me a text editor has been sufficient so far.
(If you are on Linux, you could give Quanta Plus a try)
Peter

PS: You could ask the company supplying the catalog for a copy in a more
accessible format, assuming you are a customer rather than a competitor.
Jul 18 '05 #6
jennyw <je****@dangerousideas.com> writes:
On Wed, Nov 05, 2003 at 11:23:36AM +0100, Peter Otten wrote: [...] Thanks! Whare are the main advantages of HTMLParser over htmllib?
It won't choke on XHTML.
[...] It would be easier if I could find a way to view the HTML as a tree ...
as a side note, are there any good utils to do this?


Not that I know of (google for it), but DOM is probably the easiest
way to make one. DOM libraries often have a prettyprint function to
(textually) print DOM nodes (eg. 4DOM from PyXML), which I've found
quite useful -- but of course that's just a chunk of the HTML nicely
reformatted as XHTML. Alternatively, you could use something like
graphviz / dot and some DOM-traversing code to make graphical trees.
Unfortunately, if this is HTML 'as deployed' (ie. unparseable junk),
you may have to run it through HTMLTidy before it goes into your DOM
parser (use mxTidy or uTidylib).
John
Jul 18 '05 #7
Mathias Waack fed this fish to the penguins on Wednesday 05 November
2003 00:21 am:

And whats the problem? HTML code produced by broken software like
Frontpage often contains unnecessary quotes - why do you wont to
preserve this crap?
Those look to be intentional "s -- marker for "inches"

7" x 3" -> 7 inches by 3 inches

-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Bestiaria Home Page: http://www.beastie.dm.net/ <
Home Page: http://www.dm.net/~wulfraed/ <


Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: KC | last post by:
I have written a parser using htmllib.HTMLParser and it functions fine unless the HTML is malformed. For example, is some instances, the provider of the HTML leaves out the <TR> tags but includes...
1
by: Dfenestr8 | last post by:
Hi. I want a routine that strips a line of html of all it's tags. e.g I want it to turn .... "<p><b>This is an <h1><blink>IRRITATING</blink></h1> line of </b>text</p>" .... into ...... ...
2
by: Jürgen Holly | last post by:
Hi! I have the following xml-node: <docu> <p>Sample: <b>bold</b></p> <p>and text in <i>italic</i></p> </docu> I need to create a text-file, so I set the output-mode to text.
3
by: geir.smestad | last post by:
Using Ubuntu Breezy Badger 5.10. I get the following traceback: ----- Traceback (most recent call last): File "/home/geir/programmering/htmlparse/formatter.py", line 1, in -toplevel- import...
1
by: David Bear | last post by:
I'm trying to understand how to use the HTMLParser in htmllib but I'm not seeing enough examples. I just want to grab the contents of everything enclosed in a '<body>' tag, i.e. items from where...
0
by: axjacob | last post by:
I am using html and formater as shown below. They are used as part of a larger program. Even though I don't use any print statements, the htmllib seems to be throwing parts of the html page on to...
1
by: =?Utf-8?B?QWxCcnVBbg==?= | last post by:
I have a regular expression for capturing all occurrences of words contained between {{ and }} in a file. My problem is I need to capture what is between those symbols. For instance, if I have...
1
by: zeny | last post by:
Hey ppl, How can we capture text between html tags using regular expressions? For example, how to capture the words "hello", "world", "bla", "bla" and "bla" in the following input: <br><i>hello...
3
by: zeny | last post by:
Hey ppl, How can we capture text between html tags using regular expressions? For example, how to capture the words "hello", "world", "bla", "bla" and "bla" in the following input: <br><i>hello...
0
by: VivesProcSPL | last post by:
Obviously, one of the original purposes of SQL is to make data query processing easy. The language uses many English-like terms and syntax in an effort to make it easy to learn, particularly for...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.