Help Parsing an HTML File

egonslokar

Hello Python Community,

It'd be great if someone could provide guidance or sample code for
accomplishing the following:

I have a single unicode file that has descriptions of hundreds of
objects. The file fairly resembles HTML-EXAMPLE pasted below.

I need to parse the file in such a way to extract data out of the html
and to come up with a tab separated file that would look like OUTPUT-
FILE below.

Any tips, advice and guidance is greatly appreciated.

Thanks,

Egon

=====OUTPUT-FILE=====
/please note that the first line of the file contains column headers/
------Tab Separated Output File Begin------
H1 H2 DIV Segment1 Segment2 Segment3
RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
RoséSegmentDIV3-1
PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 No-Value No-Value
BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV1-3 No-Value No-Value
YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDIV1-4
YellowSegmentDIV2-4 No-Value
------Tab Separated Output File End------

=====HTML-EXAMPLE=====
------HTML Example Begin------
<html>

<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
<div>RoséDIV-1</div>
<div "segment1">RoséSegmentDIV1-1</div><br>
<div "segment2">RoséSegmentDIV2-1</div><br>
<div "segment3">RoséSegmentDIV3-1</div><br>
<br>
<br>

<h1>PinkH1-2</h1>
<h2>PinkH2-2</h2>
<div>PinkDIV2-2</div>
<div "segment1">PinkSegmentDIV1-2</div><br>
<br>
<comment></comment>

<h1>BlackH1-3</h1>
<h2>BlackH2-3</h2>
<div>BlackDIV2-3</div>
<div "segment1">BlackSegmentDIV1-3</div><br>

<h1>YellowH1-4</h1>
<h2>YellowH2-4</h2>
<div>YellowDIV2-4</div>
<div "segment1">YellowSegmentDIV1-4</div><br>
<div "segment2">YellowSegmentDIV2-4</div><br>

</html>
------HTML Example End------

Feb 15 '08 #1

Subscribe Post Reply

1520

Tim Chase

I need to parse the file in such a way to extract data out of the html

and to come up with a tab separated file that would look like OUTPUT-
FILE below.

BeautifulSoup[1]. Your one-stop-shop for all your HTML parsing
needs.

What you do with the parsed data, is an exercise left to the
reader, but it's navigable.

-tkc

[1] http://www.crummy.com/software/BeautifulSoup/

Feb 15 '08 #2

Mike Driscoll

On Feb 15, 3:28 pm, egonslo...@gmail.com wrote:

Hello Python Community,

It'd be great if someone could provide guidance or sample code for
accomplishing the following:

I have a single unicode file that has descriptions of hundreds of
objects. The file fairly resembles HTML-EXAMPLE pasted below.

I need to parse the file in such a way to extract data out of the html
and to come up with a tab separated file that would look like OUTPUT-
FILE below.

Any tips, advice and guidance is greatly appreciated.

Thanks,

Egon

=====OUTPUT-FILE=====
/please note that the first line of the file contains column headers/
------Tab Separated Output File Begin------
H1 H2 DIV Segment1 Segment2 Segment3
RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
RoséSegmentDIV3-1
PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 No-Value No-Value
BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV1-3 No-Value No-Value
YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDIV1-4
YellowSegmentDIV2-4 No-Value
------Tab Separated Output File End------

=====HTML-EXAMPLE=====
------HTML Example Begin------
<html>

<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
<div>RoséDIV-1</div>
<div "segment1">RoséSegmentDIV1-1</div><br>
<div "segment2">RoséSegmentDIV2-1</div><br>
<div "segment3">RoséSegmentDIV3-1</div><br>
<br>
<br>

<h1>PinkH1-2</h1>
<h2>PinkH2-2</h2>
<div>PinkDIV2-2</div>
<div "segment1">PinkSegmentDIV1-2</div><br>
<br>
<comment></comment>

<h1>BlackH1-3</h1>
<h2>BlackH2-3</h2>
<div>BlackDIV2-3</div>
<div "segment1">BlackSegmentDIV1-3</div><br>

<h1>YellowH1-4</h1>
<h2>YellowH2-4</h2>
<div>YellowDIV2-4</div>
<div "segment1">YellowSegmentDIV1-4</div><br>
<div "segment2">YellowSegmentDIV2-4</div><br>

</html>
------HTML Example End------

Pyparsing, ElementTree and lxml are all good candidates as well.
BeautifulSoup takes care of malformed html though.

http://pyparsing.wikispaces.com/
http://effbot.org/zone/element-index.htm
http://codespeak.net/lxml/

Mike

Feb 15 '08 #3

Stefan Behnel

eg********@gmail.com wrote:

I have a single unicode file that has descriptions of hundreds of
objects. The file fairly resembles HTML-EXAMPLE pasted below.

I need to parse the file in such a way to extract data out of the html
and to come up with a tab separated file that would look like OUTPUT-
FILE below.

=====OUTPUT-FILE=====
/please note that the first line of the file contains column headers/
------Tab Separated Output File Begin------
H1 H2 DIV Segment1 Segment2 Segment3
RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
------Tab Separated Output File End------

=====HTML-EXAMPLE=====
------HTML Example Begin------
<html>

<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
<div>RoséDIV-1</div>
<div "segment1">RoséSegmentDIV1-1</div><br>
<div "segment2">RoséSegmentDIV2-1</div><br>
<div "segment3">RoséSegmentDIV3-1</div><br>
<br>
<br>

</html>
------HTML Example End------

Now, what ugly markup is that? You will never manage to get any HTML compliant
parser return the "segmentX" stuff in there. I think your best bet is really
going for pyparsing or regular expressions (and I actually recommend pyparsing
here).

Stefan

Feb 16 '08 #4

Peter Otten

Stefan Behnel wrote:

eg********@gmail.com wrote:
>I have a single unicode file that has descriptions of hundreds of
objects. The file fairly resembles HTML-EXAMPLE pasted below.

I need to parse the file in such a way to extract data out of the html
and to come up with a tab separated file that would look like OUTPUT-
FILE below.

=====OUTPUT-FILE=====
/please note that the first line of the file contains column headers/
------Tab Separated Output File Begin------
H1 H2 DIV Segment1 Segment2 Segment3
RosÃ©H1-1 RosÃ©H2-1 RosÃ©DIV-1 RosÃ©SegmentDIV1-1 RosÃ©SegmentDIV2-1
------Tab Separated Output File End------

=====HTML-EXAMPLE=====
------HTML Example Begin------
<html>

<h1>RosÃ©H1-1</h1>
<h2>RosÃ©H2-1</h2>
<div>RosÃ©DIV-1</div>
<div "segment1">RosÃ©SegmentDIV1-1</div><br>
<div "segment2">RosÃ©SegmentDIV2-1</div><br>
<div "segment3">RosÃ©SegmentDIV3-1</div><br>
<br>
<br>

</html>
------HTML Example End------

Now, what ugly markup is that? You will never manage to get any HTML
compliant parser return the "segmentX" stuff in there. I think your best
bet is really going for pyparsing or regular expressions (and I actually
recommend pyparsing here).

Stefan

In practice the following might be sufficient:

from BeautifulSoup import BeautifulSoup

def chunks(bs):
chunk = []
for tag in bs.findAll(["h1", "h2", "div"]):
if tag.name == "h1":
if chunk:
yield chunk
chunk = []
chunk.append(tag)
if chunk:
yield chunk

def process(filename):
bs = BeautifulSoup(open(filename))
for chunk in chunks(bs):
columns = [tag.string for tag in chunk]
columns += ["No Value"] * (6 - len(columns))
print "\t".join(columns)

if __name__ == "__main__":
process("example.html")

The biggest caveat is that only columns at the end of a row may be left out.

Peter

Feb 16 '08 #5

Similar topics

XML file parsing with SAX

by: Willem Ligtenberg | last post by:

I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception...

Python

Help with a Simple Question

by: Terry | last post by:

Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...

Javascript

Cannot use mail() in IE, only works in a debugger--help

by: baustin75 | last post by:

Posted: Mon Oct 03, 2005 1:41 pm Post subject: cannot mail() in ie only when debugging in php designer 2005 -------------------------------------------------------------------------------- ...

PHP

Parsing complex xml file with C#

by: Pir8 | last post by:

I have a complex xml file, which contains stories within a magazine. The structure of the xml file is as follows: <?xml version="1.0" encoding="ISO-8859-1" ?> <magazine> <story>...

C# / C Sharp

Parsing an HTML table with XML

by: Rick Walsh | last post by:

I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr>...

.NET Framework

Parsing an html/aspx file

by: Neil.Smith | last post by:

I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...

ASP.NET

VB6 OR VBA & Webbrowser DOM Tiny $50 Mini Project Programmer help

by: gunimpi | last post by:

http://www.vbforums.com/showthread.php?p=2745431#post2745431 ******************************************************** VB6 OR VBA & Webbrowser DOM Tiny $50 Mini Project Programmer help wanted...

Microsoft Access / VBA

HELP: Parsing XML fails under frames...

by: hzgt9b | last post by:

I've written a simple javascript page that parses an XML file... (Actually I just modified the "Parsing an XML File" sample from http://www.w3schools.com/dom/dom_parser.asp) The page works great...

Javascript

HTML File Parsing

by: Felipe De Bene | last post by:

I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH...

Python

my code won't validate, help!

by: embz | last post by:

this post concerns three pages. 1. this page: http://www.katherine-designs.com/sendemail.php i get the following errors: a lot of it seems to deal with the PHP code i inserted to the page....

HTML / CSS

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice