Parsing HTML, extracting text and changing attributes.

sebzzz

Hi,

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.

Jun 18 '07 #1

Subscribe Reply

2468

Rob Wolfe

seb...@gmail.co m wrote:

So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.

Take a look at parsing example on this page:
http://wiki.python.org/moin/SimplePrograms

--
HTH,
Rob

Jun 18 '07 #2

Stefan Behnel

se****@gmail.co m wrote:

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.

lxml is what you're looking for, especially if you're familiar with XPath.

http://codespeak.net/lxml/dev

Stefan

Jun 18 '07 #3

Neil Cerutti

On 2007-06-18, se****@gmail.co m <se****@gmail.c omwrote:

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I
should use to do this and what technique I should use.

You could get good results, and save yourself some effort, using
links or lynx with the command line options to dump page text to
a file. Python would still be needed to automate calling links or
lynx on all your documents.

--
Neil Cerutti

Jun 18 '07 #4

Jay Loden

Neil Cerutti wrote:

You could get good results, and save yourself some effort, using
links or lynx with the command line options to dump page text to
a file. Python would still be needed to automate calling links or
lynx on all your documents.

OP was looking for a way to parse out part of the file and apply classes to certain types of tags. Using lynx/links wouldn't help, since the output of links or lynx is going to end up as plain text and the desire isn't to strip all the formatting.

Someone else mentioned lxml but as I understand it lxml will only work if it's valid XHTML that they're working with. Assuming it's not (since real-world HTML almost never is), perhaps BeautifulSoup will fare better.

http://www.crummy.com/software/Beaut...mentation.html

-Jay

Jun 18 '07 #5

Stefan Behnel

Jay Loden wrote:

Someone else mentioned lxml but as I understand it lxml will only work if
it's valid XHTML that they're working with.

No, it was meant as the OP requested. It even has a very good parser from
broken HTML.

http://codespeak.net/lxml/dev/parsing.html#parsing-html

Stefan

Jun 18 '07 #6

Jay Loden

Stefan Behnel wrote:

Jay Loden wrote:
>Someone else mentioned lxml but as I understand it lxml will only work if
it's valid XHTML that they're working with.

No, it was meant as the OP requested. It even has a very good parser from
broken HTML.

http://codespeak.net/lxml/dev/parsing.html#parsing-html

I stand corrected, I missed that whole part of the LXML documentation :-)

Jun 18 '07 #7

Jay Loden

Stefan Behnel wrote:

Jay Loden wrote:
>Someone else mentioned lxml but as I understand it lxml will only work if
it's valid XHTML that they're working with.

No, it was meant as the OP requested. It even has a very good parser from
broken HTML.

http://codespeak.net/lxml/dev/parsing.html#parsing-html

I stand corrected, I missed that whole part of the LXML documentation :-)

Jun 18 '07 #8

sebzzz

I see there is a couple of tools I could use, and I also heard of
sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
htmllib ...

Is there any of those tools that does the job I need to do more easily
and what should I use? Maybe a combination of those tools, which one
is better for what part of the work?

Jun 18 '07 #9

Stefan Behnel

se****@gmail.co m wrote:

I see there is a couple of tools I could use, and I also heard of
sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
htmllib ...

Is there any of those tools that does the job I need to do more easily
and what should I use? Maybe a combination of those tools, which one
is better for what part of the work?

Well, as I said, use lxml. It's fast, pythonically easy to use, extremely
powerful and extensible. Apart from being the main author :), I actually use
it for lots of tiny things more or less like what you're off to. It's just
plain great for a quick script that gets you from A to B for a bag of documents.

Parse it in with HTML parser (even from URLs), then use XPath to extract
(exactly) what you want and then work on it as you wish. That's short and
simple in lxml.

http://codespeak.net/lxml/dev/tutorial.html
http://codespeak.net/lxml/dev/parsing.html#parsing-html
http://codespeak.net/lxml/dev/xpathxslt.html#xpath

Stefan

Jun 18 '07 #10

Similar topics

2906

Help with a Simple Question

by: Terry | last post by:

Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed loaded into cache, the slideshow doesn't look very nice. I am not sure how/when to call the slideshow() function to make sure it starts after the preload has been completed.

Javascript

3384

Parsing text into web page table entries?

by: .:mmac:. | last post by:

I have a bunch of files (Playlist files for media player) and I am trying to create an automatically generated web page that includes the last 20 or 30 of these files. The files are created every week and are named XX-XX-XX.ASX where the X's represent the date i.e. 05-22-05.asx The files are a specific format and will always contain tags like the following: <TITLE>My media file title</TITLE> <AUTHOR>Media file author</AUTHOR> <Ref href =...

ASP / Active Server Pages

2694

Parsing an HTML a tag

by: George | last post by:

How can I parse an HTML file and collect only that the A tags. I have a start for the code but an unable to figure out how to finish the code. HTML_parse gets the data from the URL document. Thanks for the help def HTML_parse(data): from HTMLParser import HTMLParser parser = MyHTMLParser() parser.feed(data)

Python

6299

Parsing / processing a stream of HTML

by: Mark Rae | last post by:

Hi, I'm using HttpWebRequest and HttpWebResponse to return a stream of HTML. Looking for advice as to the accepted / easiest / most efficient way to process this HTML in the background i.e. I don't want to display it all to the user, just pull out certain pieces of it. Specifically, I'm looking to evaluate the tabledefs it contains - walk through their rows and columns etc.

C# / C Sharp

4061

Parsing Baseball Stats

by: ankitdesai | last post by:

I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics" portions of Babe Ruth page (http://www.baseballprospectus.com/dt/ruthba01.shtml) and store that info in a CSV file. Also, I would like to do this for numerous players whose IDs I have stored in a text file (e.g.: cobbty01, ruthba01, speaktr01, etc.)....

Python

6839

Parsing an html/aspx file

by: Neil.Smith | last post by:

I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance what I'm trying to do is upload a html file onto the web server, convert it to aspx file and then parse it for input tags/controls, which in turn will become fields in a newly created database table. Clearly when the aspx file is called the...

ASP.NET

8124

How do I adjust html programming for wide screen monitors? - widescreen.JPG (0/1)

by: Christopher Glenn | last post by:

I have very basic html skills. My friend who has a wide screen monitor and is using IE7 sent me a jpg screen shot of my home page. I have attached this jpg, but I recall a while back that attachments were discouraged. I am sorry if I have offended anyone by attaching. Note the left side table tiles or repeats towards the right side of the screen, but the text does not, so it cannot be seen under the graphic.

HTML / CSS

468

Parsing HTML

by: mtuller | last post by:

Alright. I have tried everything I can find, but am not getting anywhere. I have a web page that has data like this: <tr > <td headers="col1_1" style="width:21%" > LETTER</td> <td headers="col2_1" style="width:13%; text-align:right" > 33,699</td> <td headers="col3_1" style="width:13%; text-align:right" > 1.0</td>

Python

1389

Parsing and Editing Source

by: Paul Wilson | last post by:

Hi all, I'd like to be able to do the following to a python source file programmatically: * Read in a source file * Add/Remove/Edit Classes, methods, functions * Add/Remove/Edit Decorators * List the Classes * List the imported modules * List the functions

Python

8969

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8788

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9476

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9263

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8210

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6053

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4825

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3279

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2193

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General