Parsing HTML, extracting text and changing attributes.

sebzzz

Hi,

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.

Jun 18 '07 #1

Subscribe Post Reply

2447

Rob Wolfe

seb...@gmail.com wrote:

So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.

Take a look at parsing example on this page:
http://wiki.python.org/moin/SimplePrograms

--
HTH,
Rob

Jun 18 '07 #2

Stefan Behnel

se****@gmail.com wrote:

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.

lxml is what you're looking for, especially if you're familiar with XPath.

http://codespeak.net/lxml/dev

Stefan

Jun 18 '07 #3

Neil Cerutti

On 2007-06-18, se****@gmail.com <se****@gmail.comwrote:

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I
should use to do this and what technique I should use.

You could get good results, and save yourself some effort, using
links or lynx with the command line options to dump page text to
a file. Python would still be needed to automate calling links or
lynx on all your documents.

--
Neil Cerutti

Jun 18 '07 #4

Jay Loden

Neil Cerutti wrote:

You could get good results, and save yourself some effort, using
links or lynx with the command line options to dump page text to
a file. Python would still be needed to automate calling links or
lynx on all your documents.

OP was looking for a way to parse out part of the file and apply classes to certain types of tags. Using lynx/links wouldn't help, since the output of links or lynx is going to end up as plain text and the desire isn't to strip all the formatting.

Someone else mentioned lxml but as I understand it lxml will only work if it's valid XHTML that they're working with. Assuming it's not (since real-world HTML almost never is), perhaps BeautifulSoup will fare better.

http://www.crummy.com/software/Beaut...mentation.html

-Jay

Jun 18 '07 #5

Stefan Behnel

Jay Loden wrote:

Someone else mentioned lxml but as I understand it lxml will only work if
it's valid XHTML that they're working with.

No, it was meant as the OP requested. It even has a very good parser from
broken HTML.

http://codespeak.net/lxml/dev/parsing.html#parsing-html

Stefan

Jun 18 '07 #6

Jay Loden

Stefan Behnel wrote:

Jay Loden wrote:
>Someone else mentioned lxml but as I understand it lxml will only work if
it's valid XHTML that they're working with.

No, it was meant as the OP requested. It even has a very good parser from
broken HTML.

http://codespeak.net/lxml/dev/parsing.html#parsing-html

I stand corrected, I missed that whole part of the LXML documentation :-)

Jun 18 '07 #7

Jay Loden

Stefan Behnel wrote:

Jay Loden wrote:
>Someone else mentioned lxml but as I understand it lxml will only work if
it's valid XHTML that they're working with.

No, it was meant as the OP requested. It even has a very good parser from
broken HTML.

http://codespeak.net/lxml/dev/parsing.html#parsing-html

I stand corrected, I missed that whole part of the LXML documentation :-)

Jun 18 '07 #8

sebzzz

I see there is a couple of tools I could use, and I also heard of
sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
htmllib ...

Is there any of those tools that does the job I need to do more easily
and what should I use? Maybe a combination of those tools, which one
is better for what part of the work?

Jun 18 '07 #9

Stefan Behnel

se****@gmail.com wrote:

I see there is a couple of tools I could use, and I also heard of
sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
htmllib ...

Is there any of those tools that does the job I need to do more easily
and what should I use? Maybe a combination of those tools, which one
is better for what part of the work?

Well, as I said, use lxml. It's fast, pythonically easy to use, extremely
powerful and extensible. Apart from being the main author :), I actually use
it for lots of tiny things more or less like what you're off to. It's just
plain great for a quick script that gets you from A to B for a bag of documents.

Parse it in with HTML parser (even from URLs), then use XPath to extract
(exactly) what you want and then work on it as you wish. That's short and
simple in lxml.

http://codespeak.net/lxml/dev/tutorial.html
http://codespeak.net/lxml/dev/parsing.html#parsing-html
http://codespeak.net/lxml/dev/xpathxslt.html#xpath

Stefan

Jun 18 '07 #10

Similar topics

Help with a Simple Question

by: Terry | last post by:

Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...

Javascript

Parsing text into web page table entries?

by: .:mmac:. | last post by:

I have a bunch of files (Playlist files for media player) and I am trying to create an automatically generated web page that includes the last 20 or 30 of these files. The files are created every...

ASP / Active Server Pages

Parsing an HTML a tag

by: George | last post by:

How can I parse an HTML file and collect only that the A tags. I have a start for the code but an unable to figure out how to finish the code. HTML_parse gets the data from the URL document. Thanks...

Python

Parsing / processing a stream of HTML

by: Mark Rae | last post by:

Hi, I'm using HttpWebRequest and HttpWebResponse to return a stream of HTML. Looking for advice as to the accepted / easiest / most efficient way to process this HTML in the background i.e. I...

C# / C Sharp

Parsing Baseball Stats

by: ankitdesai | last post by:

I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...

Python

Parsing an html/aspx file

by: Neil.Smith | last post by:

I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...

ASP.NET

How do I adjust html programming for wide screen monitors? - widescreen.JPG (0/1)

by: Christopher Glenn | last post by:

I have very basic html skills. My friend who has a wide screen monitor and is using IE7 sent me a jpg screen shot of my home page. I have attached this jpg, but I recall a while back that...

HTML / CSS

Parsing HTML

by: mtuller | last post by:

Alright. I have tried everything I can find, but am not getting anywhere. I have a web page that has data like this: <tr > <td headers="col1_1" style="width:21%" > <span class="hpPageText"...

Python

Parsing and Editing Source

by: Paul Wilson | last post by:

Hi all, I'd like to be able to do the following to a python source file programmatically: * Read in a source file * Add/Remove/Edit Classes, methods, functions * Add/Remove/Edit Decorators *...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice