473,508 Members | 2,300 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

html 2 plain text

hi,
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!

robin

May 28 '06 #1
5 1985
robin wrote:
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!


Have a look at the Beautiful Soup library:
http://www.crummy.com/software/BeautifulSoup/

Regards

--
Faber
http://faberbox.com/
http://smarking.com/

A teacher must always teach to doubt his teaching. -- José Ortega y Gasset
May 28 '06 #2
lucks yummy. merci beaucoup.

robin

May 28 '06 #3
> i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)


http://www.aaronsw.com/2002/html2text/

May 28 '06 #4
robin <ro*********@gmail.com> wrote:
hi,
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!


text=re.sub(r'(?s)\<.+?\>', '', html_text)
(this will keep html entities, though)

--
-----------------------------------------------------------
| Radovan GarabÃ*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
May 29 '06 #5
ga******************@kassiopeia.juls.savba.sk wrote:
text=re.sub(r'(?s)\<.+?\>', '', html_text)
(this will keep html entities, though)


here's a variation that handles that too:

http://effbot.org/zone/re-sub.htm#strip-html

</F>

May 29 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
4151
by: J. Alan Rueckgauer | last post by:
Hello. I'm looking for a simple way to do the following: We have a database that serves-up content to a website. Some of those items are events, some are news articles. They're stored in the...
12
2408
by: Mr. Clean | last post by:
As you may know, spammer use this technique to get by filters. <!H>It<!W> is<!N> <!K>a<!L> w<!Q>el<!Q>l <!X>k<!O>now<!B>n <!F>f<!G>a<!V>c<!O>t <!S>th<!B>at p<!R>eopl<!J>e<!G> <!Z>who...
14
6840
by: Akseli Mäki | last post by:
Hi, Hopefully this is not too much offtopic. I'm working on a FAQ. I want to make two versions of it, plain text and HTML. I'm looking for a tool that will make a plain text doc out of the...
8
5276
by: LRW | last post by:
I'm not sure this message is totally appropriate for this group, so please, if anyone has a better group suggestion, let me know! My company sends out a monthly newsletter in HTML format to our...
3
1409
by: tdeclercq | last post by:
HI, I want to be sure that users receive and can read my mails. How can I send mail in HTML AND text Format in one send ? thx
15
4713
by: Nathan | last post by:
I have an aspx page with a data grid, some textboxes, and an update button. This page also has one html input element with type=file (not inside the data grid and runat=server). The update...
10
3455
by: Eric Lindsay | last post by:
This may be too far off topic, however I was looking at this page http://www.hixie.ch/advocacy/xhtml about XHTML problems by Ian Hickson. It is served as text/plain, according to Firefox...
0
1861
by: Rey | last post by:
Howdy all. Am using visual web developer 2005 (vb), xp pro sp2. In testing of the system.net.mail to send email from an aspx page where I'm pulling the email contents from a textbox, find that...
2
8783
by: =?Utf-8?B?Q2FwdGFpbiBEYXZlIQ==?= | last post by:
I wrote some code to send an email with two alternate views: 1) html 2) plain text All the html enabled email clients accept the html just fine and disregard the plain text version. However,...
17
1781
by: V S Rawat | last post by:
I joined this ng and tried to post my first message that had a small php code (HTML and all). my newsserver aioe.net rejected the post saying "HTML Tags". My message was in text format, not in...
0
7233
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7410
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
7067
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7505
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5650
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
4729
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3201
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1570
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
440
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.