473,396 Members | 1,997 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Parsing an HTML file

I am trying to find the best way to parse a bunch of html files. They
are all simillar in structure and I need to get them into a database.
Their relevant structure is:
<html><head></head>
<body>
<h1>title</h1>
<address> authors </address>
<div> Main html content</div>

I basically need to get the values between <h1></h1>,
<address></address> and <div></div>

I am able to read the the files into an array.
Jul 18 '05 #1
3 1445
CodeGuru73:
I am trying to find the best way to parse a bunch of html files.


http://www.python.org/doc/current/li...TMLParser.html

--
René Pijlman
Jul 18 '05 #2
"CodeGuru73" <ed***********@yahoo.com> wrote in message
news:5e**************************@posting.google.c om...
I am trying to find the best way to parse a bunch of html files. They
are all simillar in structure and I need to get them into a database.
Their relevant structure is:
<html><head></head>
<body>
<h1>title</h1>
<address> authors </address>
<div> Main html content</div>

I basically need to get the values between <h1></h1>,
<address></address> and <div></div>

I am able to read the the files into an array.


Check out this simple XML parsing code:
http://aspn.activestate.com/ASPN/Coo.../Recipe/157358

-- Paul
Jul 18 '05 #3
CodeGuru73 wrote:
I am trying to find the best way to parse a bunch of html files. They
are all simillar in structure and I need to get them into a database.


You can read chapter "HTML Processing" in Diveintopython:
http://diveintopython.org/html_processing/index.html
Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Willem Ligtenberg | last post by:
I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception...
6
by: Hans Kamp | last post by:
Is it possible to write a function like the following: string ReadURL(string URL) { .... } The purpose is that it reads the URL (determined by the parameter) and returns the string in which...
3
by: Pir8 | last post by:
I have a complex xml file, which contains stories within a magazine. The structure of the xml file is as follows: <?xml version="1.0" encoding="ISO-8859-1" ?> <magazine> <story>...
11
by: Řrjan Langbakk | last post by:
I'm parsing a CSV-file into a table on a webpage - and I'd like to be able to change the alignment for the _last_ <td> in each <tr> - but, as the file is today, it's not possible for me to assign a...
4
by: Rick Walsh | last post by:
I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr>...
0
by: bruce | last post by:
hi... it appears that i'm running into a possible problem with mechanize/browser/python rgarding the "select_form" method. i've tried the following and get the error listed: br.select_form(nr...
0
by: june | last post by:
Hi, I have a big problem with parsing HTML into a XHTML using Cberneko to validate the html. First I tried to work with a HTML-File. This solutions works fine: String aHTMLFile =...
4
by: Neil.Smith | last post by:
I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...
1
by: Robert Neville | last post by:
Basically, I want to create a table in html, xml, or xslt; with any number of regular expressions; a script (Perl or Python) which reads each table row (regex and replacement); and performs the...
2
by: Felipe De Bene | last post by:
I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.