PyParsing module or HTMLParser

Lad

I came across pyparsing module by Paul McGuire. It seems to be nice but
I am not sure if it is the best for my need.
I need to extract some text from html page. The text is in tables and a
table can be inside another table.
Is it better and easier to use the pyparsing module or HTMLparser?

Thanks for suggestions.
La.

Jul 18 '05 #1

Subscribe Post Reply

1969

Bill Mill

On 28 Mar 2005 12:01:34 -0800, Lad <py****@hope.cz> wrote:

I came across pyparsing module by Paul McGuire. It seems to be nice but
I am not sure if it is the best for my need.
I need to extract some text from html page. The text is in tables and a
table can be inside another table.
Is it better and easier to use the pyparsing module or HTMLparser?

You might want to check out BeautifulSoup at:
http://www.crummy.com/software/BeautifulSoup/ .

Peace
Bill Mill
bill.mill at gmail.com

Jul 18 '05 #2

EuGeNe

Lad wrote:

I came across pyparsing module by Paul McGuire. It seems to be nice but
I am not sure if it is the best for my need.
I need to extract some text from html page. The text is in tables and a
table can be inside another table.
Is it better and easier to use the pyparsing module or HTMLparser?

Thanks for suggestions.
La.

Check BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)it
did the job for me!

--
EuGeNe

[----
www.boardkulture.com
www.actiphot.com
www.xsbar.com
----]

Jul 18 '05 #3

Paul McGuire

La -

In general, I have shied away from doing general-purpose HTML parsing
with pyparsing. It's a crowded field, and it's likely that there are
better candidates out there for your problem. I've heard good things
about BeautifulSoup, but I've also heard from at least one person that
they prefer pyparsing to BS.

I personally have had good luck with *simple* HTML scraping with
pyparsing, such as extracting data from tables. It just depends on how
variable your source text is. Tables within tables may be a bit
challenging, but we'll never know unless you provide more to go on. If
you post a URL or some sample HTML, I could give you a more definitive
answer (possibly even a working code sample, you never know).

-- Paul

Jul 18 '05 #4

Lad

Paul,
Thank you for your reply.

Here is a test page that I woul like to test with PyParsing

http://www.ourglobalmarket.com/Test.htm

From that

I would like to extract the tittle ( it is below Lanjin Electronics
Co., Ltd. )
(Sell 2.4GHz Wireless Mini Color Camera With Audio Function )

description - below the tittle next to the picture
Contact person
Company name
Address
fax
phone
Website Address

Do you think that the PyParsing will work for that?

Best regards,
Lad.

Jul 18 '05 #5

Paul McGuire

Lad -

Well, here's what I've got so far. I'll leave the extraction of the
description to you as an exercise, but as a clue, it looks like it is
delimited by "<b>View Detail</b></a></td></tr></tbody></table> <br>" at
the beginning, and "Quantity: 500<br>" at the end, where 500 could be
any number. This program will print out:

['Title:', 'Sell 2.4GHz Wireless Mini Color Camera With Audio Function
Manufacturers Hong Kong - Exporters, Suppliers, Factories, Seller']
['Contact:', 'Mr. Simon Cheung']
['Company:', 'Lanjin Electronics Co., Ltd.']
['Address:', 'Rm 602, 6/F., Tung Ning Bldg., 2 Hillier Street, Sheung
Wan , Hong Kong\n , HK\n ( Hong Kong
)']
['Phone:', '852 35763877']
['Fax:', '852 31056238']
['Mobile:', '852-96439737']

So I think pyparsing will get you pretty far along the way. Code
attached below (unfortunately, I am posting thru Google Groups, which
strips leading whitespace, so I have inserted '.'s to preserve code
indentation; just strip the leading '.' characters).

-- Paul

===================================
from pyparsing import *
import urllib

# get input data
url = "http://www.ourglobalmarket.com/Test.htm"
page = urllib.urlopen( url )
pageHTML = page.read()
page.close()

#~ I would like to extract the tittle ( it is below Lanjin Electronics
#~ Co., Ltd. )
#~ (Sell 2.4GHz Wireless Mini Color Camera With Audio Function )

#~ description - below the tittle next to the picture
#~ Contact person
#~ Company name
#~ Address
#~ fax
#~ phone
#~ Website Address

LANGBRK = Literal("<")
RANGBRK = Literal(">")
SLASH = Literal("/")
tagAttr = Word(alphanums) + "=" + dblQuotedString

# helpers for defining HTML tag expressions
def startTag( tagname ):
.....return ( LANGBRK + CaselessLiteral(tagname) + \
................ZeroOrMore(tagAttr) + RANGBRK ).suppress()
def endTag( tagname ):
.....return ( LANGBRK + SLASH + CaselessLiteral(tagname) + RANGBRK
).suppress()
def makeHTMLtags( tagname ):
.....return startTag(tagname), endTag(tagname)
def strong( expr ):
.....return strongStartTag + expr + strongEndTag

strongStartTag, strongEndTag = makeHTMLtags("strong")
titleStart, titleEnd = makeHTMLtags("title")
tdStart, tdEnd = makeHTMLtags("td")
h1Start, h1End = makeHTMLtags("h1")

title = titleStart + SkipTo( titleEnd ).setResultsName("title") +
titleEnd
contactPerson = tdStart + h1Start + \
................SkipTo( h1End ).setResultsName("contact")
company = ( tdStart + strong("Company:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("company")
address = ( tdStart + strong("Address:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("address")
phoneNum = ( tdStart + strong("Phone:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("phoneNum")
faxNum = ( tdStart + strong("Fax:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("faxNum")
mobileNum = ( tdStart + strong("Mobile:") + tdEnd + tdStart ) + \
................SkipTo( tdEnd ).setResultsName("mobileNum")
webSite = ( tdStart + strong("Website Address:") + tdEnd + tdStart )
+ \
................SkipTo( tdEnd ).setResultsName("webSite")
scrapes = title | contactPerson | company | address | phoneNum | faxNum
| mobileNum | webSite

# use parse actions to remove hyperlinks
linkStart, linkEnd = makeHTMLtags("a")
linkExpr = linkStart + SkipTo( linkEnd ) + linkEnd
def stripHyperLink(s,l,t):
.....return [ t[0], linkExpr.transformString( t[1] ) ]
company.setParseAction( stripHyperLink )

# use parse actions to add labels for data elements that don't
# have labels in the HTML
def prependLabel(pre):
.....def prependAction(s,l,t):
.........return [pre] + t[:]
.....return prependAction
title.setParseAction( prependLabel("Title:") )
contactPerson.setParseAction( prependLabel("Contact:") )

for tokens,start,end in scrapes.scanString( pageHTML ):
.....print tokens

Jul 18 '05 #6

Lad

Paul, thanks a lot.
It seems to work but I will have to study the sample hard to be able to
do the exercise (the extraction of the
description ) successfully. Is it possible to email you if I need some
help with that exercise?
Thanks again for help
Lad.

Jul 18 '05 #7

Paul McGuire

Yes, drop me a note if you get stuck.

-- Paul
base64.decodestring('cHRtY2dAYXVzdGluLnJyLmNvbQ==' )

Jul 18 '05 #8

Similar topics

Question regarding HTMLParser module.

by: Adonis | last post by:

When parsing my html files, I use handle_pi to capture some embedded python code, but I have noticed that in the embedded python code if it contains html, HTMLParser will parse it as well, and thus...

Python

Proposed API change for pyparsing CaselessLiteral - could break existing code

by: Paul McGuire | last post by:

***This is of especial interest for those who are using the pyparsing module, and have defined grammars that make use of CaselessLiteral.*** One of the bugfix requests I recently got for...

Python

Saving search results in a dictionary

by: Lukas Holcik | last post by:

Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could....

Python

[pyparsing] How to get arbitrary text surrounded by keywords?

by: Inyeol Lee | last post by:

I'm trying to extract module contents from Verilog, which has the form of; module foo (port1, port2, ... ); // module contents to extract here. ... endmodule

Python

pyparsing: crash on empty element

by: gry | last post by:

I have: def unpack_sql_array(s): # unpack a postgres "array", e.g. "{'w1','w2','w3'}" into a list(str) import pyparsing as pp withquotes = pp.dblQuotedString.setParseAction(pp.removeQuotes)...

Python

Pyparsing Question.

by: Ant | last post by:

I have a home-grown Wiki that I created as an excercise, with it's own wiki markup (actually just a clone of the Trac wiki markup). The wiki text parser I wrote works nicely, but makes heavy use of...

Python

pyparsing Catch-22

by: 7stud | last post by:

To the developer: 1) I went to the pyparsing wiki to download the pyparsing module and try it 2) At the wiki, there was no index entry in the table of contents for Downloads. After searching...

Python

Help With PyParsing of output from win32pdhutil.ShowAllProcesses()

by: Steve | last post by:

Hi All (especially Paul McGuire!) Could you lend a hand in the grammar and paring of the output from the function win32pdhutil.ShowAllProcesses()? This is the code that I have so far (it is...

Python

help with pyparsing

by: Neal Becker | last post by:

I'm just trying out pyparsing. I get stack overflow on my first try. Any help? #/usr/bin/python from pyparsing import Word, alphas, QuotedString, OneOrMore, delimitedList first_line = ''...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice