470,612 Members | 2,307 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,612 developers. It's quick & easy.

Mozilla HTML parser in Python?


I'm working on a machine learning project that I have lots of HTML news stories to work with. I made an online tool that lets me look at these pages via mozilla and hand label them by adding my own tags into the DOM.

The problem I've found is that the pages I'm using are being tidied by Mozilla. For example, <p/> is changed into proper <p> and </p> tags.

The problem is that I also want to be able to process these pages without having to go via Mozilla but the Python HTML parsers I've tried output different HTML and some of them just ignore the <p/> shorthand.

I was wondering if anyone knew of a parser module in Python that will perform the same changes to input HTML that happens to pages displayed in the Mozilla browser?


Jan 9 '08 #1
0 963

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

4 posts views Thread by Leif K-Brooks | last post: by
4 posts views Thread by Jakub Fast | last post: by
6 posts views Thread by Walter Dörwald | last post: by
2 posts views Thread by mlybarger | last post: by
5 posts views Thread by Johannes Bauer | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.