470,563 Members | 1,977 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,563 developers. It's quick & easy.

Mozilla HTML parser in Python?


I'm working on a machine learning project that I have lots of HTML news stories to work with. I made an online tool that lets me look at these pages via mozilla and hand label them by adding my own tags into the DOM.

The problem I've found is that the pages I'm using are being tidied by Mozilla. For example, <p/> is changed into proper <p> and </p> tags.

The problem is that I also want to be able to process these pages without having to go via Mozilla but the Python HTML parsers I've tried output different HTML and some of them just ignore the <p/> shorthand.

I was wondering if anyone knew of a parser module in Python that will perform the same changes to input HTML that happens to pages displayed in the Mozilla browser?


Jan 9 '08 #1
1 1752
There is no <p /> shorthand.
In all things you are better off using standards compliant source.

If you are trying to add some extra information, you could try using XHTML and namespace tags. eg

<custom:p> </custom:p>
<custom:p />

This way html parsers should hopefully ignore the name-spaced tags.

Ugly hack though.
Im sure there is a better way to do what you are attempting.
Jan 12 '08 #2

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

4 posts views Thread by Leif K-Brooks | last post: by
4 posts views Thread by Jakub Fast | last post: by
6 posts views Thread by Walter Dörwald | last post: by
2 posts views Thread by mlybarger | last post: by
reply views Thread by tegdim | last post: by
1 post views Thread by livre | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.