By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,538 Members | 1,293 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,538 IT Pros & Developers. It's quick & easy.

text representation of HTML

P: n/a
Hi,

I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Title</h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?

Thanks!
Ksenia.
Jul 19 '06 #1
Share this Question
Share on Google+
6 Replies


P: n/a
Ksenia Marasanova wrote:
Hi,

I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Title</h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?
html2text is a commandline tool. You can invoke it from python using
subprocess.

Diez
Jul 19 '06 #2

P: n/a
Hi,

I guess stripogram would be more pythonic :
http://sourceforge.net/project/showf...?group_id=1083

Regards,

Laurent

Diez B. Roggisch wrote:
Ksenia Marasanova wrote:
>Hi,

I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Title</h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?

html2text is a commandline tool. You can invoke it from python using
subprocess.

Diez
Jul 19 '06 #3

P: n/a
Ksenia Marasanova <ks***************@gmail.comwrote:
Hi,

I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Title</h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
something like this:

import re
text = '<div><h1>Title</h1><p>This is a <br />test</p></div>'
text = re.sub(r'[\n\ \t]+', ' ', text)
text = re.sub(r'(?i)(\<p\>|\<br\>|\<h[1-6]\>)', '\n', text)
result = re.sub('<.+?>', '', text)
print result

--
-----------------------------------------------------------
| Radovan GarabĂ*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
Jul 20 '06 #4

P: n/a
Ksenia Marasanova wrote:
I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Title</h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?
Use htmllib:
>>import htmllib, formatter, StringIO
def cleanup(s):
out = StringIO.StringIO()
p = htmllib.HTMLParser(
formatter.AbstractFormatter(formatter.DumbWriter(o ut)))
p.feed(s)
p.close()
if p.anchorlist:
print >>out
for idx,anchor in enumerate(p.anchorlist):
print >>out, "\n[%d]: %s" % (idx+1,anchor)
return out.getvalue()
>>print cleanup('''<div><h1>Title</h1><p>This is a <br
/>test</p></div>''')

Title

This is a
test
>>print cleanup('''<div><h1>Title</h1><p>This is a <br />test with <a
href="http://python.org">a link</ato the Python homepage</p></div>''')

Title

This is a
test with a link[1] to the Python homepage

[1]: http://python.org
Jul 20 '06 #5

P: n/a
On 20 Jul 2006 15:12:27 GMT, Duncan Booth <du**********@invalid.invalidwrote:
Ksenia Marasanova wrote:
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?

Use htmllib:
>import htmllib, formatter, StringIO
def cleanup(s):
out = StringIO.StringIO()
p = htmllib.HTMLParser(
formatter.AbstractFormatter(formatter.DumbWriter(o ut)))
p.feed(s)
p.close()
if p.anchorlist:
print >>out
for idx,anchor in enumerate(p.anchorlist):
print >>out, "\n[%d]: %s" % (idx+1,anchor)
return out.getvalue()
>print cleanup('''<div><h1>Title</h1><p>This is a <br
/>test</p></div>''')

Title

This is a
test
>print cleanup('''<div><h1>Title</h1><p>This is a <br />test with <a
href="http://python.org">a link</ato the Python homepage</p></div>''')

Title

This is a
test with a link[1] to the Python homepage

[1]: http://python.org
cleanup() doesn't handle script and styles too well. html2text will
do a much better job of these and give a more structured output
(compatible with Markdown)

http://www.aaronsw.com/2002/html2text/
>>import html2text
print html2text.html2text('''<div><h1>Title</h1><p>This is a <br
/>test with <a href="http://python.org">a link</ato the Python
homepage</p></div>''')

# Title

This is a
test with [a link][1] to the Python homepage

[1]: http://python.org
HTH :)
Jul 20 '06 #6

P: n/a
Sorry for the late reply... better too late than never :)
Thanks to all for the tips. Stripogram is the winner, since it is the
most configurable and accept line-length parameter, which is handy for
email...

Ksenia.

On 7/19/06, Laurent Rahuel <lr*************@voila.frwrote:
Hi,

I guess stripogram would be more pythonic :
http://sourceforge.net/project/showf...?group_id=1083

Regards,

Laurent

Diez B. Roggisch wrote:
Ksenia Marasanova wrote:
Hi,

I am looking for a library that will give me very simple text
representation of HTML.
For example
<div><h1>Title</h1><p>This is a <br />test</p></div>

will be transformed to:

Title

This is a
test
i want to send plain text alternative of html email, and would prefer
to do it automatically from HTML source.
Any hints?
html2text is a commandline tool. You can invoke it from python using
subprocess.

Diez

--
http://mail.python.org/mailman/listinfo/python-list
Sep 21 '06 #7

This discussion thread is closed

Replies have been disabled for this discussion.