Hi,
I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.
I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.
Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.
So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use. 9 2468
seb...@gmail.co m wrote:
So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.
Take a look at parsing example on this page: http://wiki.python.org/moin/SimplePrograms
--
HTH,
Rob se****@gmail.co m wrote:
I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.
I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.
Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.
So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.
lxml is what you're looking for, especially if you're familiar with XPath. http://codespeak.net/lxml/dev
Stefan
On 2007-06-18, se****@gmail.co m <se****@gmail.c omwrote:
I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.
I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.
Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.
So, I'm writing this to have your opinion on what tools I
should use to do this and what technique I should use.
You could get good results, and save yourself some effort, using
links or lynx with the command line options to dump page text to
a file. Python would still be needed to automate calling links or
lynx on all your documents.
--
Neil Cerutti
Neil Cerutti wrote:
You could get good results, and save yourself some effort, using
links or lynx with the command line options to dump page text to
a file. Python would still be needed to automate calling links or
lynx on all your documents.
OP was looking for a way to parse out part of the file and apply classes to certain types of tags. Using lynx/links wouldn't help, since the output of links or lynx is going to end up as plain text and the desire isn't to strip all the formatting.
Someone else mentioned lxml but as I understand it lxml will only work if it's valid XHTML that they're working with. Assuming it's not (since real-world HTML almost never is), perhaps BeautifulSoup will fare better. http://www.crummy.com/software/Beaut...mentation.html
-Jay
Jay Loden wrote:
Someone else mentioned lxml but as I understand it lxml will only work if
it's valid XHTML that they're working with.
No, it was meant as the OP requested. It even has a very good parser from
broken HTML. http://codespeak.net/lxml/dev/parsing.html#parsing-html
Stefan
Stefan Behnel wrote:
Jay Loden wrote:
>Someone else mentioned lxml but as I understand it lxml will only work if it's valid XHTML that they're working with.
No, it was meant as the OP requested. It even has a very good parser from
broken HTML.
http://codespeak.net/lxml/dev/parsing.html#parsing-html
I stand corrected, I missed that whole part of the LXML documentation :-)
Stefan Behnel wrote:
Jay Loden wrote:
>Someone else mentioned lxml but as I understand it lxml will only work if it's valid XHTML that they're working with.
No, it was meant as the OP requested. It even has a very good parser from
broken HTML.
http://codespeak.net/lxml/dev/parsing.html#parsing-html
I stand corrected, I missed that whole part of the LXML documentation :-)
I see there is a couple of tools I could use, and I also heard of
sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
htmllib ...
Is there any of those tools that does the job I need to do more easily
and what should I use? Maybe a combination of those tools, which one
is better for what part of the work? se****@gmail.co m wrote:
I see there is a couple of tools I could use, and I also heard of
sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
htmllib ...
Is there any of those tools that does the job I need to do more easily
and what should I use? Maybe a combination of those tools, which one
is better for what part of the work?
Well, as I said, use lxml. It's fast, pythonically easy to use, extremely
powerful and extensible. Apart from being the main author :), I actually use
it for lots of tiny things more or less like what you're off to. It's just
plain great for a quick script that gets you from A to B for a bag of documents.
Parse it in with HTML parser (even from URLs), then use XPath to extract
(exactly) what you want and then work on it as you wish. That's short and
simple in lxml. http://codespeak.net/lxml/dev/tutorial.html http://codespeak.net/lxml/dev/parsing.html#parsing-html http://codespeak.net/lxml/dev/xpathxslt.html#xpath
Stefan This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Terry |
last post by:
Hi,
This is a newbie's question. I want to preload 4 images and only when
all 4 images has been loaded into browser's cache, I want to start a
slideshow() function. If images are not completed loaded into cache,
the slideshow doesn't look very nice.
I am not sure how/when to call the slideshow() function to make sure it
starts after the preload has been completed.
|
by: .:mmac:. |
last post by:
I have a bunch of files (Playlist files for media player) and I am trying to
create an automatically generated web page that includes the last 20 or 30
of these files. The files are created every week and are named XX-XX-XX.ASX
where the X's represent the date i.e. 05-22-05.asx
The files are a specific format and will always contain tags like the
following:
<TITLE>My media file title</TITLE>
<AUTHOR>Media file author</AUTHOR>
<Ref href =...
|
by: George |
last post by:
How can I parse an HTML file and collect only that the A tags. I have a
start for the code but an unable to figure out how to finish the code.
HTML_parse gets the data from the URL document. Thanks for the help
def HTML_parse(data):
from HTMLParser import HTMLParser
parser = MyHTMLParser()
parser.feed(data)
|
by: Mark Rae |
last post by:
Hi,
I'm using HttpWebRequest and HttpWebResponse to return a stream of HTML.
Looking for advice as to the accepted / easiest / most efficient way to
process this HTML in the background i.e. I don't want to display it all to
the user, just pull out certain pieces of it.
Specifically, I'm looking to evaluate the tabledefs it contains - walk
through their rows and columns etc.
|
by: ankitdesai |
last post by:
I would like to parse a couple of tables within an individual player's
SHTML page. For example, I would like to get the "Actual Pitching
Statistics" and the "Translated Pitching Statistics" portions of Babe
Ruth page (http://www.baseballprospectus.com/dt/ruthba01.shtml) and
store that info in a CSV file.
Also, I would like to do this for numerous players whose IDs I have
stored in a text file (e.g.: cobbty01, ruthba01, speaktr01, etc.)....
| |
by: Neil.Smith |
last post by:
I can't seem to find any references to this, but here goes:
In there anyway to parse an html/aspx file within an asp.net
application to gather a collection of controls in the file. For
instance what I'm trying to do is upload a html file onto the web
server, convert it to aspx file and then parse it for input
tags/controls, which in turn will become fields in a newly created
database table.
Clearly when the aspx file is called the...
|
by: Christopher Glenn |
last post by:
I have very basic html skills. My friend who has a wide screen
monitor and is using IE7 sent me a jpg screen shot of my home page.
I have attached this jpg, but I recall a while back that attachments
were discouraged. I am sorry if I have offended anyone by attaching.
Note the left side table tiles or repeats towards the right side of
the screen, but the text does not, so it cannot be seen under the
graphic.
|
by: mtuller |
last post by:
Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:
<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
|
by: Paul Wilson |
last post by:
Hi all,
I'd like to be able to do the following to a python source file
programmatically:
* Read in a source file
* Add/Remove/Edit Classes, methods, functions
* Add/Remove/Edit Decorators
* List the Classes
* List the imported modules
* List the functions
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| | |