By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,445 Members | 1,159 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,445 IT Pros & Developers. It's quick & easy.

Mozilla HTML parser in Python?

P: 9
Hello,

I'm working on a machine learning project that I have lots of HTML news stories to work with. I made an online tool that lets me look at these pages via mozilla and hand label them by adding my own tags into the DOM.

The problem I've found is that the pages I'm using are being tidied by Mozilla. For example, <p/> is changed into proper <p> and </p> tags.

The problem is that I also want to be able to process these pages without having to go via Mozilla but the Python HTML parsers I've tried output different HTML and some of them just ignore the <p/> shorthand.

I was wondering if anyone knew of a parser module in Python that will perform the same changes to input HTML that happens to pages displayed in the Mozilla browser?

Thanks,

Nico
Jan 9 '08 #1
Share this Question
Share on Google+
1 Reply


P: 32
There is no <p /> shorthand.
In all things you are better off using standards compliant source.

If you are trying to add some extra information, you could try using XHTML and namespace tags. eg

<custom:p> </custom:p>
<custom:p />

This way html parsers should hopefully ignore the name-spaced tags.

Ugly hack though.
Im sure there is a better way to do what you are attempting.
Jan 12 '08 #2

Post your reply

Sign in to post your reply or Sign up for a free account.