(htmllib) How to capture text that includes tags?

jennyw

I'm trying to parse a product catalog written in HTML. Some of the
information I need are attributes of tags (like the product name, which
is in an anchor). Some (like product description) are between tags
(in the case of product description, the tag is font).

To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end() only
includes text that isn't marked up. For example, this product
description:


This rectangle measures 7" x 3".


Drops the quotation marks, resulting in:

This rectangle mesaures 7 x 3.

I've been looking through Google Groups but haven't found a way to get
the markup in between the tags. Any suggestions?

This is relevant portion of the class I'm using so far:

class myHTMLParser(ht mllib.HTMLParse r):

def __init__(self,f ):
htmllib.HTMLPar ser.__init__(se lf, f)
self

def start_font(self , attrs):
self.save_bgn()

def end_font(self):
text = self.save_end()
if text:
if re.search("\\.\ \s*$", text):
print "Probably a product description: " + text

# I needed to override save_end because it was having trouble
# when data was nothing.

def save_end(self):
"""Ends buffering character data and returns all data saved since
the preceding call to the save_bgn() method.

If the nofill flag is false, whitespace is collapsed to single
spaces. A call to this method without a preceding call to the
save_bgn() method will raise a TypeError exception.

"""
data = self.savedata
self.savedata = None
if data:
if not self.nofill:
data = ' '.join(data.spl it())
return data

Thanks!

Jen

Jul 18 '05 #1

Subscribe Reply

3357

Mathias Waack

jennyw wrote:

I'm trying to parse a product catalog written in HTML. Some of the
information I need are attributes of tags (like the product name,
which is in an anchor). Some (like product description) are between
tags (in the case of product description, the tag is font).

To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end()
only
includes text that isn't marked up. For example, this product
description:


This rectangle measures 7" x 3".


Drops the quotation marks, resulting in:

This rectangle mesaures 7 x 3.

And whats the problem? HTML code produced by broken software like
Frontpage often contains unnecessary quotes - why do you wont to
preserve this crap?

If you want to escape special characters you can use
xml.sax.saxutil s.escape() or just write your own function (escape is
only a two liner).

Mathias

Jul 18 '05 #2

Peter Otten

jennyw wrote:

I'm trying to parse a product catalog written in HTML. Some of the
information I need are attributes of tags (like the product name, which
is in an anchor). Some (like product description) are between tags
(in the case of product description, the tag is font).

To capture product descriptions, I've been using the save_bgn() and
save_end() methods. But I've noticed that the result of save_end() only
includes text that isn't marked up. For example, this product
description:


This rectangle measures 7" x 3".


Drops the quotation marks, resulting in:

This rectangle mesaures 7 x 3.

I've been looking through Google Groups but haven't found a way to get
the markup in between the tags. Any suggestions?

This is relevant portion of the class I'm using so far:

class myHTMLParser(ht mllib.HTMLParse r):

def __init__(self,f ):
htmllib.HTMLPar ser.__init__(se lf, f)
self

def start_font(self , attrs):
self.save_bgn()

def end_font(self):
text = self.save_end()
if text:
if re.search("\\.\ \s*$", text):
print "Probably a product description: " + text

# I needed to override save_end because it was having trouble
# when data was nothing.

def save_end(self):
"""Ends buffering character data and returns all data saved since
the preceding call to the save_bgn() method.

If the nofill flag is false, whitespace is collapsed to single
spaces. A call to this method without a preceding call to the
save_bgn() method will raise a TypeError exception.

"""
data = self.savedata
self.savedata = None
if data:
if not self.nofill:
data = ' '.join(data.spl it())
return data

Thanks!

Jen

I've found the parser in the HTMLParser module to be a lot easier to use.
Below is the rough equivalent of your posted code. In the general case you
will want to keep a stack of tags instead of the simple infont flag.

import HTMLParser, htmlentitydefs

class CatalogParser(H TMLParser.HTMLP arser):
entitydefs = htmlentitydefs. entitydefs

def __init__(self):
HTMLParser.HTML Parser.__init__ (self)
self.infont = False
self.text = []

def handle_starttag (self, tag, atts):
if tag == "font":
assert not self.infont
self.infont = True

def handle_entityre f(self, name):
if self.infont:
self.handle_dat a(self.entityde fs.get(name, "?"))

def handle_data(sel f, data):
if self.infont:
self.text.appen d(data)

def handle_endtag(s elf, tag):
if tag == "font":
assert self.infont
self.infont = False
if self.text:
print "".join(self.te xt)

data = """
<html>
<body>
<h1>"Ignor e me"</h1>

This &wuerg; rectangle measures 7" x 3".

</body>
</html>
"""
p = CatalogParser()
p.feed(data)
p.close()

Peter

Jul 18 '05 #3

Paul Rubin

I've generally found that trying to parse the whole page with
regexps isn't appropriate. Here's a class that I use sometimes.
Basically you do something like

b = buf(urllib.urlo pen(url).read() )

and then search around for patterns you expect to find in the page:

b.search("name of the product")
b.rsearch('<a href="')
href = b.up_to('"')

Note that there's an esearch method that lets you do forward searches
for regexps (defaults to case independent since that's usually what
you want for html). But unfortunately, due to a deficiency in the Python
library, there's no simple way to implement backwards regexp searches.

Maybe I'll clean up the interface for this thing sometime.

=============== =============== =============== =============== ====

import re

class buf:
def __init__(self, text=''):
self.buf = text
self.point = 0
self.stack = []

def seek(self, offset, whence='set'):
if whence=='set':
self.point = offset
elif whence=='cur':
self.point += offset
elif whence=='end':
self.point = len(self.buf) - offset
else:
raise ValueError, "whence must be one of ('set','cur','e nd')"

def save(self):
self.stack.appe nd(self.point)

def restore(self):
self.point = self.stack.pop( )

def search(self, str):
p = self.buf.index( str, self.point)
self.point = p + len(str)
return self.point

def esearch(self, pat, *opts):
opts = opts or [re.I]
p = re.compile(pat, *opts)
g = p.search(self.b uf, self.point)
self.point = g.end()
return self.point

def rsearch(self, str):
p = self.buf.rindex (str, 0, self.point)
self.point = p
return self.point

def up_to(self, str):
a = self.point
b = self.search(str )
return self.buf[a:b-1]

Jul 18 '05 #4

jennyw

On Wed, Nov 05, 2003 at 11:23:36AM +0100, Peter Otten wrote:

I've found the parser in the HTMLParser module to be a lot easier to use.
Below is the rough equivalent of your posted code. In the general case you
will want to keep a stack of tags instead of the simple infont flag.

Thanks! Whare are the main advantages of HTMLParser over htmllib?

The code gives me something to think about ... it doesn't work right now
because it turns out there are nested font tags (which means the asserts
fail, and if I comment them out, it generates a 53 MB file from a < 1 MB
source file). I'll try playing with it and seeing if I can get it to do
what I want.

It would be easier if I could find a way to view the HTML as a tree ...
as a side note, are there any good utils to do this?

Thanks again!

Jen

Jul 18 '05 #5

Peter Otten

jennyw wrote:

On Wed, Nov 05, 2003 at 11:23:36AM +0100, Peter Otten wrote:
I've found the parser in the HTMLParser module to be a lot easier to use.
Below is the rough equivalent of your posted code. In the general case
you will want to keep a stack of tags instead of the simple infont flag.
Thanks! Whare are the main advantages of HTMLParser over htmllib?

Basically htmllib.HTMLPar ser feeds a formatter that I don't need with
information that I would rather disregard.
HTMLParser.HTML Parser, on the other hand, has a simple interface (you've
pretty much seen it all in my tiny example).
The code gives me something to think about ... it doesn't work right now
because it turns out there are nested font tags (which means the asserts
fail, and if I comment them out, it generates a 53 MB file from a < 1 MB
source file). I'll try playing with it and seeing if I can get it to do
what I want.
I would suspect that there are tags without a corresponding .
You could fix that by preprocessing the html source with a tool like tidy.
As an aside, font tags as search criteria are as bad as you can get. Try to
find something more specific, e. g. the "second column in every row of the
first table". If this gets too complex for HTMLParser, you can instead
convert the html into xml (again via tidy) and then read it into a dom
tree.
It would be easier if I could find a way to view the HTML as a tree ...
as a side note, are there any good utils to do this?

I've never applied this primitive data extraction technique to large complex
html files, so for me a text editor has been sufficient so far.
(If you are on Linux, you could give Quanta Plus a try)
Peter

PS: You could ask the company supplying the catalog for a copy in a more
accessible format, assuming you are a customer rather than a competitor.

Jul 18 '05 #6

John J. Lee

jennyw <je****@dangero usideas.com> writes:

On Wed, Nov 05, 2003 at 11:23:36AM +0100, Peter Otten wrote: [...] Thanks! Whare are the main advantages of HTMLParser over htmllib?
It won't choke on XHTML.
[...] It would be easier if I could find a way to view the HTML as a tree ...
as a side note, are there any good utils to do this?

Not that I know of (google for it), but DOM is probably the easiest
way to make one. DOM libraries often have a prettyprint function to
(textually) print DOM nodes (eg. 4DOM from PyXML), which I've found
quite useful -- but of course that's just a chunk of the HTML nicely
reformatted as XHTML. Alternatively, you could use something like
graphviz / dot and some DOM-traversing code to make graphical trees.
Unfortunately, if this is HTML 'as deployed' (ie. unparseable junk),
you may have to run it through HTMLTidy before it goes into your DOM
parser (use mxTidy or uTidylib).
John

Jul 18 '05 #7

Dennis Lee Bieber

Mathias Waack fed this fish to the penguins on Wednesday 05 November
2003 00:21 am:

And whats the problem? HTML code produced by broken software like
Frontpage often contains unnecessary quotes - why do you wont to
preserve this crap?
Those look to be intentional "s -- marker for "inches"

7" x 3" -> 7 inches by 3 inches

-- =============== =============== =============== =============== == <
wl*****@ix.netc om.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
=============== =============== =============== =============== == <
Bestiaria Home Page: http://www.beastie.dm.net/ <
Home Page: http://www.dm.net/~wulfraed/ <

Jul 18 '05 #8

Similar topics

4316

htmllib.py and parsing malformed HTML

by: KC | last post by:

I have written a parser using htmllib.HTMLParser and it functions fine unless the HTML is malformed. For example, is some instances, the provider of the HTML leaves out the <TR> tags but includes the </TR> tags. Apparently, htmllib and more likely sgmllib do not parse an end tag if a corresponding start tag was not found. Does anyone know a way to "fool" the parser into handling the end tag is a start tag was not found? Thanks,

Python

8845

An example using htmllib?

by: Dfenestr8 | last post by:

Hi. I want a routine that strips a line of html of all it's tags. e.g I want it to turn .... "This is an <h1><blink>IRRITATING</blink></h1> line of text" .... into ...... "This is an IRRITATING line of text"

Python

2888

transformation in text-mode

by: Jürgen Holly | last post by:

Hi! I have the following xml-node: <docu> Sample: bold and text in italic </docu> I need to create a text-file, so I set the output-mode to text.

.NET Framework

2309

Cannot import htmllib

by: geir.smestad | last post by:

Using Ubuntu Breezy Badger 5.10. I get the following traceback: ----- Traceback (most recent call last): File "/home/geir/programmering/htmlparse/formatter.py", line 1, in -toplevel- import formatter File "/home/geir/programmering/htmlparse/formatter.py", line 2, in -toplevel- import htmllib

Python

1777

understanding htmllib

by: David Bear | last post by:

I'm trying to understand how to use the HTMLParser in htmllib but I'm not seeing enough examples. I just want to grab the contents of everything enclosed in a '<body>' tag, i.e. items from where <bodybegins to where </bodyends. I start by doing class HTMLBody(HTMLParser): def __init__(self): self.contents =

Python

1022

Htmllib help

by: axjacob | last post by:

I am using html and formater as shown below. They are used as part of a larger program. Even though I don't use any print statements, the htmllib seems to be throwing parts of the html page on to the standard out(my screen in this case). Is there a way to disable the output? import htmllib w = formatter.DumbWriter() format = formatter.AbstractFormatter(w) p = htmllib.HTMLParser(format) p.feed(inhtml) p.close()

Python

2723

Using Regex to capture phrases inside of tags

by: =?Utf-8?B?QWxCcnVBbg==?= | last post by:

I have a regular expression for capturing all occurrences of words contained between {{ and }} in a file. My problem is I need to capture what is between those symbols. For instance, if I have tags such as {{FirstName}}, {{LastName}}, and {{Address}} placed in the file, I need to be able to capture the text strings of FirstName, LastName and Address, respectively. I'm sure it can be done with Regex as easily as finding the locations of...

C# / C Sharp

1646

HELP !!! (Regex: Capture text between html tags)

by: zeny | last post by:

Hey ppl, How can we capture text between html tags using regular expressions? For example, how to capture the words "hello", "world", "bla", "bla" and "bla" in the following input: hello world bla bla bla Best Regards

.NET Framework

3081

HELP !!! (Capture text between html tags)

by: zeny | last post by:

Java

8969

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8792

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9266

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9209

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6754

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6054

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4570

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4826

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3280

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp