473,804 Members | 2,201 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

text representation of HTML

Hi,

I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Titl e</h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?

Thanks!
Ksenia.
Jul 19 '06 #1
6 1453
Ksenia Marasanova wrote:
Hi,

I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Titl e</h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?
html2text is a commandline tool. You can invoke it from python using
subprocess.

Diez
Jul 19 '06 #2
Hi,

I guess stripogram would be more pythonic :
http://sourceforge.net/project/showf...?group_id=1083

Regards,

Laurent

Diez B. Roggisch wrote:
Ksenia Marasanova wrote:
>Hi,

I am looking for a library that will give me very simple text
representati on of HTML.
For example
<div><h1>Title </h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?

html2text is a commandline tool. You can invoke it from python using
subprocess.

Diez
Jul 19 '06 #3
Ksenia Marasanova <ks************ ***@gmail.comwr ote:
Hi,

I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Titl e</h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
something like this:

import re
text = '<div><h1>Title </h1><p>This is a <br />test</p></div>'
text = re.sub(r'[\n\ \t]+', ' ', text)
text = re.sub(r'(?i)(\ <p\>|\<br\>|\ <h[1-6]\>)', '\n', text)
result = re.sub('<.+?>', '', text)
print result

--
-----------------------------------------------------------
| Radovan GarabÃ*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls .savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
Jul 20 '06 #4
Ksenia Marasanova wrote:
I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Title </h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?
Use htmllib:
>>import htmllib, formatter, StringIO
def cleanup(s):
out = StringIO.String IO()
p = htmllib.HTMLPar ser(
formatter.Abstr actFormatter(fo rmatter.DumbWri ter(out)))
p.feed(s)
p.close()
if p.anchorlist:
print >>out
for idx,anchor in enumerate(p.anc horlist):
print >>out, "\n[%d]: %s" % (idx+1,anchor)
return out.getvalue()
>>print cleanup('''<div ><h1>Title</h1><p>This is a <br
/>test</p></div>''')

Title

This is a
test
>>print cleanup('''<div ><h1>Title</h1><p>This is a <br />test with <a
href="http://python.org">a link</ato the Python homepage</p></div>''')

Title

This is a
test with a link[1] to the Python homepage

[1]: http://python.org
Jul 20 '06 #5
On 20 Jul 2006 15:12:27 GMT, Duncan Booth <du**********@i nvalid.invalidw rote:
Ksenia Marasanova wrote:
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?

Use htmllib:
>import htmllib, formatter, StringIO
def cleanup(s):
out = StringIO.String IO()
p = htmllib.HTMLPar ser(
formatter.Abstr actFormatter(fo rmatter.DumbWri ter(out)))
p.feed(s)
p.close()
if p.anchorlist:
print >>out
for idx,anchor in enumerate(p.anc horlist):
print >>out, "\n[%d]: %s" % (idx+1,anchor)
return out.getvalue()
>print cleanup('''<div ><h1>Title</h1><p>This is a <br
/>test</p></div>''')

Title

This is a
test
>print cleanup('''<div ><h1>Title</h1><p>This is a <br />test with <a
href="http://python.org">a link</ato the Python homepage</p></div>''')

Title

This is a
test with a link[1] to the Python homepage

[1]: http://python.org
cleanup() doesn't handle script and styles too well. html2text will
do a much better job of these and give a more structured output
(compatible with Markdown)

http://www.aaronsw.com/2002/html2text/
>>import html2text
print html2text.html2 text('''<div><h 1>Title</h1><p>This is a <br
/>test with <a href="http://python.org">a link</ato the Python
homepage</p></div>''')

# Title

This is a
test with [a link][1] to the Python homepage

[1]: http://python.org
HTH :)
Jul 20 '06 #6
Sorry for the late reply... better too late than never :)
Thanks to all for the tips. Stripogram is the winner, since it is the
most configurable and accept line-length parameter, which is handy for
email...

Ksenia.

On 7/19/06, Laurent Rahuel <lr************ *@voila.frwrote :
Hi,

I guess stripogram would be more pythonic :
http://sourceforge.net/project/showf...?group_id=1083

Regards,

Laurent

Diez B. Roggisch wrote:
Ksenia Marasanova wrote:
Hi,

I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Titl e</h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?
html2text is a commandline tool. You can invoke it from python using
subprocess.

Diez

--
http://mail.python.org/mailman/listinfo/python-list
Sep 21 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
2378
by: Don | last post by:
I checked the FAQs but don't seem to fing anything that covers a current question I have. In my old age it seems I've become a coder and not a programmer. I've put an image on the left, with descriptive text to the right, centered: <img src="james.jpg" height="300" width="200" target="_BLANK" ALIGN=CENTER> James ca. 1874 </A> However, for one image the text is much longer and wraps to the bottom of the image and not immediately under...
2
3225
by: RobG | last post by:
Why does Firefox insert #text nodes as children of TR elements? As a work-around for older Safari versions not properly supporting a table row's cells collection, I used the row's childNodes collection as it was pretty much exactly the same thing. However, in Firefox 1.0.7 text nodes are inserted between the TDs. I'm certain that this didn't use to happen with older versions. The HTML specification states that the only element that...
14
2359
by: Stefan Mueller | last post by:
With the following code I can add a new row to an existing table. That really works great. Many thanks to all who helped me so far. But my problem is that the added cells do somehow not have the same style as the first row which I added by HTML. I do everything with the JavaScript what I do with HTML except that the added text with the JavaScript is not <h5 class = "style_tableentry_middle">Entry middle</h5> I guess it's only somehow...
10
3253
by: Mantorok Redgormor | last post by:
I always see posts that involve the representation of integers, where some poster claims that the unerlyding representation of an integer doesn't have to reflect on the actual integer, for example: int foo = 0; 0 can be all zeros 0x00000000 or 00000000 00000000 00000000 00000000 Then someone chimes in and says 0 doesn't have to contain all zeros..
8
9539
by: Yeow | last post by:
hello, i was trying to use the fread function on SunOS and ran into some trouble. i made a simple test as follows: i'm trying to read in a binary file (generated from a fortran code) that contains the following three floating-point numbers: 1.0 2.0 3.0
2
1594
by: noblEnds | last post by:
Hi. A quick thanks to those who try to help. here's what i'm trying to do: <?xml> <stuff> <theStory> <p>aaklsjd fakljs fakjs faskldj a;klsjdf l;aksdj f THIS TEXT SHOULD BE HIGHLIGHED a;klsdjf a;slkjdf a;skljdf a;slkjdf a;slkjf a;sklj fas;kl jf;ak s</p>
8
2122
by: Derek | last post by:
Hi Hope that this is the correct newsgroup for this, sorry if it is not. I have the following code which is used to display the data brought back from a MySQL database in an input box so that a user cam make changes before resubmitting them. If I display $myrow all of the text is there, if I use the following code only the first word is show The same happens if I simply use $myrow and
1
3453
by: Xah Lee | last post by:
Text Processing with Emacs Lisp Xah Lee, 2007-10-29 This page gives a outline of how to use emacs lisp to do text processing, using a specific real-world problem as example. If you don't know elisp, first take a gander at Emacs Lisp Basics. HTML version with links and colors is at: http://xahlee.org/emacs/elisp_text_processing.html
3
4105
by: jackson.rayne | last post by:
Hello, Another newbie question here. Let me explain my situation first. I have bought a 3rd party tool that runs a PHP script and gives me some HTML code which I can directly use in my pages. The code generated is normal HTML code, example
0
9711
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9593
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10595
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10343
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10088
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9169
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6862
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5668
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4306
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.