On Oct 30, 6:44 pm, "一首诗" <newpt...@gmail.comwrote:
Oh, I didn't make myself clear.
>
What I mean is how to convert a piece of html to plain text bu keep as
much format as possible.
>
Such as convert " " to blank space and convert <brto "\r\n"
>
Then you can explore the parser,
http://docs.python.org/lib/module-HTMLParser.html, like
#!/usr/bin/env python
from HTMLParser import HTMLParser
parsedtext = ''
class Parser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'br':
global parsedtext
parsedtext += '\\r\\n'
def handle_data(self, data):
global parsedtext
parsedtext += data
def handle_entityref(self, name):
if name == 'nbsp':
pass
x = Parser()
x.feed('An text<br>')
print parsedtext
Gary Herron wrote:
一首诗 wrote:
Is there any simple way to solve this problem?
>
Yes, strings have a replace method:
>
>>s = "abc def"
>>s.replace(' ',' ')
'abc def'
>
Also various modules that are meant to deal with web and xml and such
have functions to do such operations.
Gary Herron