Mozilla HTML parser in Python?

Hello,

I'm working on a machine learning project that I have lots of HTML news stories to work with. I made an online tool that lets me look at these pages via mozilla and hand label them by adding my own tags into the DOM.

The problem I've found is that the pages I'm using are being tidied by Mozilla. For example, is changed into proper and tags.

The problem is that I also want to be able to process these pages without having to go via Mozilla but the Python HTML parsers I've tried output different HTML and some of them just ignore the shorthand.

I was wondering if anyone knew of a parser module in Python that will perform the same changes to input HTML that happens to pages displayed in the Mozilla browser?

Thanks,

Nico

Jan 9 '08 #1

Subscribe Post Reply

1826

garrow

There is no shorthand.
In all things you are better off using standards compliant source.

If you are trying to add some extra information, you could try using XHTML and namespace tags. eg

<custom:p> </custom:p>
<custom:p />

This way html parsers should hopefully ignore the name-spaced tags.

Ugly hack though.
Im sure there is a better way to do what you are attempting.

Jan 12 '08 #2

by: Leif K-Brooks | last post by:

I'm writing a site with mod_python which will have, among other things, forums. I want to allow users to use some HTML (, , , etc.) on the forums, but I don't want to allow bad...

Python

Mozilla, XUL and the snake

by: Jakub Fast | last post by:

Hi, Does anybody know how far you can get nowadays with trying to use Python as the script language for XUL instead of JS? Is it possible (even theoretically) to write full-fledged applications...

Python

Parsing broken HTML via Mozilla

by: Walter Dörwald | last post by:

Hello all! I'm trying to parse broken HTML with several Python tools. Unfortunately none of them work 100% reliable. Problems are e.g. nested comments, bare "&" in URLs and "<" in text (e.g....

Python

Python 2.3.5 make: *** [Parser/pgen] Error 1 Parser/grammar.o: I

by: Karalius, Joseph | last post by:

Can anyone explain what is happening here? I haven't found any useful info on Google yet. Thanks in advance. mmagnet:/home/jkaralius/src/zopeplone/Python-2.3.5 # make gcc -pthread -c...

Python

dataislands in mozilla

by: mlybarger | last post by:

we currently have a heavy dataisland based site and would like to get it working on mozilla to take advantage of the js debuggers and other various tools. there have been various issues along...

Javascript

IE and Mozilla recognize CDATA nodetype differently

by: Aziz | last post by:

Hi Folks I am trying to access an HTML code stored as CDATA section in the xml file listed bellow: <?xml version="1.0"?> <results count="5"> <!]> </results> The xml tree is the responseXML...

Javascript

Taking data from a text file to parse html page

by: DH | last post by:

Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them...