robin <ro*********@gmail.com> wrote:
hi,
i remember seeing this simple python function which would take raw html
and output the content (body?) of the page as plain text (no <..> tags
etc)
i have been looking at htmllib and htmlparser but this all seems to
complicated for what i'm looking for. i just need the main text in the
body of some arbitrary webbpage to then do some natural-language
processing with it...
thanks for pointing me to some helpful resources!
text=re.sub(r'(?s)\<.+?\>', '', html_text)
(this will keep html entities, though)
--
-----------------------------------------------------------
| Radovan GarabÃ*k
http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!