html 2 plain text

robin

hi,
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!

robin

May 28 '06 #1

Subscribe Reply

1985

Faber

robin wrote:

i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!

Have a look at the Beautiful Soup library:
http://www.crummy.com/software/BeautifulSoup/

Regards

--
Faber
http://faberbox.com/
http://smarking.com/

A teacher must always teach to doubt his teaching. -- José Ortega y Gasset

May 28 '06 #2

robin

lucks yummy. merci beaucoup.

robin

May 28 '06 #3

Ravi Teja

> i remember seeing this simple python function which would take raw html

and output the content (body?) of the page as plain text (no <..> tags
etc)

http://www.aaronsw.com/2002/html2text/

May 28 '06 #4

garabik-news-2005-05

robin <ro*********@gmail.com> wrote:

hi,
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!

text=re.sub(r'(?s)\<.+?\>', '', html_text)
(this will keep html entities, though)

--
-----------------------------------------------------------
| Radovan GarabÃ*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

May 29 '06 #5

Fredrik Lundh

ga******************@kassiopeia.juls.savba.sk wrote:

text=re.sub(r'(?s)\<.+?\>', '', html_text)
(this will keep html entities, though)

here's a variation that handles that too:

http://effbot.org/zone/re-sub.htm#strip-html

</F>

May 29 '06 #6

Similar topics

4151

How? Send Same Email as HTML *or* Plain Text

by: J. Alan Rueckgauer | last post by:

Hello. I'm looking for a simple way to do the following: We have a database that serves-up content to a website. Some of those items are events, some are news articles. They're stored in the...

ASP / Active Server Pages

2408

Why is this valid HTML?

by: Mr. Clean | last post by:

As you may know, spammer use this technique to get by filters. <!H>It<!W> is<!N> <!K>a<!L> w<!Q>el<!Q>l <!X>k<!O>now<!B>n <!F>f<!G>a<!V>c<!O>t <!S>th<!B>at p<!R>eopl<!J>e<!G> <!Z>who...

HTML / CSS

6840

Looking for a tool to make plain text document out of a simple HTML document

by: Akseli Mäki | last post by:

Hi, Hopefully this is not too much offtopic. I'm working on a FAQ. I want to make two versions of it, plain text and HTML. I'm looking for a tool that will make a plain text doc out of the...

HTML / CSS

5276

send e-mails that show both HTML and plain text?

by: LRW | last post by:

I'm not sure this message is totally appropriate for this group, so please, if anyone has a better group suggestion, let me know! My company sends out a monthly newsletter in HTML format to our...

HTML / CSS

1409

sending mail in html AND text format ?

by: tdeclercq | last post by:

HI, I want to be sure that users receive and can read my mails. How can I send mail in HTML AND text Format in one send ? thx

.NET Framework

4713

Page has Expired - using html input control (type=file)

by: Nathan | last post by:

I have an aspx page with a data grid, some textboxes, and an update button. This page also has one html input element with type=file (not inside the data grid and runat=server). The update...

ASP.NET

3455

When plain text page is treated as HTML

by: Eric Lindsay | last post by:

This may be too far off topic, however I was looking at this page http://www.hixie.ch/advocacy/xhtml about XHTML problems by Ian Hickson. It is served as text/plain, according to Firefox...

HTML / CSS

1861

Plain text and html not getting along in net.mail.message

by: Rey | last post by:

Howdy all. Am using visual web developer 2005 (vb), xp pro sp2. In testing of the system.net.mail to send email from an aspx page where I'm pulling the email contents from a textbox, find that...

ASP.NET

8783

Net.Mail.MailMessage.AlternateViews problem sending html and text

by: =?Utf-8?B?Q2FwdGFpbiBEYXZlIQ==?= | last post by:

I wrote some code to send an email with two alternate views: 1) html 2) plain text All the html enabled email clients accept the html just fine and disregard the plain text version. However,...

Visual Basic .NET

1781

How can a php ng ban HTML tags posts?

by: V S Rawat | last post by:

I joined this ng and tried to post my first message that had a small php code (HTML and all). my newsserver aioe.net rejected the post saying "HTML Tags". My message was in text format, not in...

PHP

7233

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

7410

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7067

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

7505

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5650

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

4729

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3201

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1570

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

440

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General