Hello,
I'm working on a machine learning project that I have lots of HTML news stories to work with. I made an online tool that lets me look at these pages via mozilla and hand label them by adding my own tags into the DOM.
The problem I've found is that the pages I'm using are being tidied by Mozilla. For example, <p/> is changed into proper <p> and </p> tags.
The problem is that I also want to be able to process these pages without having to go via Mozilla but the Python HTML parsers I've tried output different HTML and some of them just ignore the <p/> shorthand.
I was wondering if anyone knew of a parser module in Python that will perform the same changes to input HTML that happens to pages displayed in the Mozilla browser?
Thanks,
Nico