473,503 Members | 2,135 Online

Sign in
Join

Home Posts Topics Members FAQ

Mozilla HTML parser in Python?

9 New Member

Hello,

I'm working on a machine learning project that I have lots of HTML news stories to work with. I made an online tool that lets me look at these pages via mozilla and hand label them by adding my own tags into the DOM.

The problem I've found is that the pages I'm using are being tidied by Mozilla. For example, <p/> is changed into proper <p> and </p> tags.

The problem is that I also want to be able to process these pages without having to go via Mozilla but the Python HTML parsers I've tried output different HTML and some of them just ignore the <p/> shorthand.

I was wondering if anyone knew of a parser module in Python that will perform the same changes to input HTML that happens to pages displayed in the Mozilla browser?

Thanks,

Nico

Jan 9 '08 #1

Subscribe Reply

0

1016

Sign in to post your reply or Sign up for a free account.

Similar topics

Simple allowing of HTML elements/attributes?

by: Leif K-Brooks | last post by:

I'm writing a site with mod_python which will have, among other things, forums. I want to allow users to use some HTML (<em>, <strong>, <p>, etc.) on the forums, but I don't want to allow bad...

Mozilla, XUL and the snake

by: Jakub Fast | last post by:

Hi, Does anybody know how far you can get nowadays with trying to use Python as the script language for XUL instead of JS? Is it possible (even theoretically) to write full-fledged applications...

Parsing broken HTML via Mozilla

by: Walter Dörwald | last post by:

Hello all! I'm trying to parse broken HTML with several Python tools. Unfortunately none of them work 100% reliable. Problems are e.g. nested comments, bare "&" in URLs and "<" in text (e.g....

Python 2.3.5 make: *** [Parser/pgen] Error 1 Parser/grammar.o: I

by: Karalius, Joseph | last post by:

Can anyone explain what is happening here? I haven't found any useful info on Google yet. Thanks in advance. mmagnet:/home/jkaralius/src/zopeplone/Python-2.3.5 # make gcc -pthread -c...

dataislands in mozilla

by: mlybarger | last post by:

we currently have a heavy dataisland based site and would like to get it working on mozilla to take advantage of the js debuggers and other various tools. there have been various issues along...

IE and Mozilla recognize CDATA nodetype differently

by: Aziz | last post by:

Hi Folks I am trying to access an HTML code stored as CDATA section in the xml file listed bellow: <?xml version="1.0"?> <results count="5"> <!]> </results> The xml tree is the responseXML...

Taking data from a text file to parse html page

by: DH | last post by:

Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them...

Mozilla HTML parser in Python?

by: tegdim | last post by:

Hello, I'm working on a machine learning project that I have lots of HTML news stories to work with. I made an online tool that lets me look at these pages via mozilla and hand label them by...

problem with the Ajax responseText in Mozilla

by: RamananKalirajan | last post by:

Hi All, I am using Ajax inorder to retrieve a data from the db which is an xml and i am parsing the responseText into an xml. the code what i had tried is working well with IE, but the...

Python HTML parser chokes on UTF-8 input

by: Johannes Bauer | last post by:

Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as...

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.

BYTES.COM © 2024
About Bytes
Terms Of Use
Privacy Policy
Sitemap

Advertise on Bytes
How to Post and Respond on Bytes
How to Promote and Link on Bytes
How to increase your Ranking on Bytes
Become a Recognized Expert on Bytes
Feedback Welcomed! Contact Us