473,238 Members | 1,738 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,238 software developers and data experts.

Mozilla HTML parser in Python?

9
Hello,

I'm working on a machine learning project that I have lots of HTML news stories to work with. I made an online tool that lets me look at these pages via mozilla and hand label them by adding my own tags into the DOM.

The problem I've found is that the pages I'm using are being tidied by Mozilla. For example, <p/> is changed into proper <p> and </p> tags.

The problem is that I also want to be able to process these pages without having to go via Mozilla but the Python HTML parsers I've tried output different HTML and some of them just ignore the <p/> shorthand.

I was wondering if anyone knew of a parser module in Python that will perform the same changes to input HTML that happens to pages displayed in the Mozilla browser?

Thanks,

Nico
Jan 9 '08 #1
1 1822
garrow
32
There is no <p /> shorthand.
In all things you are better off using standards compliant source.

If you are trying to add some extra information, you could try using XHTML and namespace tags. eg

<custom:p> </custom:p>
<custom:p />

This way html parsers should hopefully ignore the name-spaced tags.

Ugly hack though.
Im sure there is a better way to do what you are attempting.
Jan 12 '08 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

4
by: Leif K-Brooks | last post by:
I'm writing a site with mod_python which will have, among other things, forums. I want to allow users to use some HTML (<em>, <strong>, <p>, etc.) on the forums, but I don't want to allow bad...
4
by: Jakub Fast | last post by:
Hi, Does anybody know how far you can get nowadays with trying to use Python as the script language for XUL instead of JS? Is it possible (even theoretically) to write full-fledged applications...
6
by: Walter Dörwald | last post by:
Hello all! I'm trying to parse broken HTML with several Python tools. Unfortunately none of them work 100% reliable. Problems are e.g. nested comments, bare "&" in URLs and "<" in text (e.g....
1
by: Karalius, Joseph | last post by:
Can anyone explain what is happening here? I haven't found any useful info on Google yet. Thanks in advance. mmagnet:/home/jkaralius/src/zopeplone/Python-2.3.5 # make gcc -pthread -c...
2
by: mlybarger | last post by:
we currently have a heavy dataisland based site and would like to get it working on mozilla to take advantage of the js debuggers and other various tools. there have been various issues along...
7
by: Aziz | last post by:
Hi Folks I am trying to access an HTML code stored as CDATA section in the xml file listed bellow: <?xml version="1.0"?> <results count="5"> <!]> </results> The xml tree is the responseXML...
13
by: DH | last post by:
Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them...
0
by: tegdim | last post by:
Hello, I'm working on a machine learning project that I have lots of HTML news stories to work with. I made an online tool that lets me look at these pages via mozilla and hand label them by...
4
RamananKalirajan
by: RamananKalirajan | last post by:
Hi All, I am using Ajax inorder to retrieve a data from the db which is an xml and i am parsing the responseText into an xml. the code what i had tried is working well with IE, but the...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.