472,984 Members | 1,981 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,984 software developers and data experts.

Using sax libxml2 html parser

Hi all,

I have created an example using libxml2 based in the code that appears
in http://xmlsoft.org/python.html.
My example processes an enough amount of html files to see that the
memory consumption rises till the process ends (I check it with the
'top' command).

I don´t know if I am forgetting something in the code, as I have not
been able to find any example on the web.

Thanks in advance, Cesar

Note: I have also tried to put the cleanup functions inside the 'for'
loop.

****************************************] The Code
[****************************************

#!/usr/bin/python -u
import libxml2

#------------------------------------------------------------------------------
# Memory debug specific
libxml2.debugMemory(1)

#------------------------------------------------------------------------------

class callback:
def startDocument(self):
print "."

def endDocument(self):
pass

def startElement(self, tag, attrs):
pass

def endElement(self, tag):
pass

def characters(self, data):
pass

def warning(self, msg):
pass

def error(self, msg):
pass

def fatalError(self, msg):
pass

#------------------------------------------------------------------------------
#------------------------------------------------------------------------------

import os
import sys

programName = os.path.basename(sys.argv[0])

if len(sys.argv) != 2:
print "Use: %s <dir html files>" % programName
sys.exit(1)

inputPath = sys.argv[1]

if not os.path.exists (inputPath):
print "Error: directory does not exist"
sys.exit(1)

inputFileNames = []
dirContent = os.listdir(inputPath)
for fichero in dirContent:
extension1=fichero.rfind(".htm")
extension2=fichero.rfind(".html")
dot = fichero.rfind(".")
extension = max(extension1,extension2)
if extension != -1 and extension == dot:
inputFileNames.append (fichero)

if len(inputFileNames) == 0:
print "Error: no input files"
sys.exit(1)
handler = callback()
NUM_ITERS = 5
for i in range(NUM_ITERS):
for inputFileName in inputFileNames:
print inputFileName
inputFilePath = inputPath + inputFileName
f = open(inputFilePath)
data = f.read()
f.close()

ctxt = libxml2.htmlCreatePushParser(handler, "", 0, inputFileName)

ctxt.htmlParseChunk(data, len(data), 1)
ctxt = None
# Memory debug specific
libxml2.cleanupParser()
if libxml2.debugMemory(1) == 0:
print "OK"
else:
print "Memory leak %d bytes" % (libxml2.debugMemory(1))
libxml2.dumpMemory()

# Other cleanup functions
#libxml2.cleanupCharEncodingHandlers()
#libxml2.cleanupEncodingAliases()
#libxml2.cleanupGlobals()
#libxml2.cleanupInputCallbacks()
#libxml2.cleanupOutputCallbacks()
#libxml2.cleanupPredefinedEntities()

Jan 5 '07 #1
1 4090
ce*********@gmail.com wrote:
I have created an example using libxml2 based in the code that appears
in http://xmlsoft.org/python.html.
My example processes an enough amount of html files to see that the
memory consumption rises till the process ends (I check it with the
'top' command).

Try the lxml binding for libxml2. It relieves you from having to do your own
memory management and makes using that library plain easy.

http://codespeak.net/lxml/

Stefan
Jan 6 '07 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: laura | last post by:
newbie... expath vs libxml2... libxml2 is an implementation of dom expath is a sax parser both comes with php installation (ie 4.3) both need to have php recompiled:
0
by: Juergen R. Plasser | last post by:
Hi, I have installed libxml2-2.5.7, libxslt-1.0.30 from source and the bindings libxml2-python-2.5.7. Everything seems to compile fine, but when I try to import the libxml2 library in python...
4
by: Lénaïc Huard | last post by:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello, I've some namespace problems when defining default values for attributes. My problem seems to come from the fact that the attributes are...
0
by: Richard Taylor | last post by:
User-Agent: OSXnews 2.07 Xref: number1.nntp.dca.giganews.com comp.lang.python:437315 Hi I am trying to use py2app (http://undefined.org/python/) to package a gnome-python application...
2
by: corley | last post by:
Hello world, I am having a problem with the decimal symbol. The result of writing a floating-point-number to an xml document (using libxml2 from xmlsoft.org) is <number>10,1234</number> ...
1
by: Andrew Marlow | last post by:
guys, I have been using libxml2 with python with no problems for just over a week but now I come to see if my script will work in someone else's environment and the libxml2 import fails. I am...
6
by: saumya.agarwal | last post by:
Hi, I am using libxml2 for xml parsing. When the client application sends data to libxml2 in UTF-8 format, it works fine. But, I have a scenarion in which the client application sends data to...
0
by: sharan | last post by:
how we can print the content of the element in a XML file using libxml2 example: this is my xml file <record> <firstname>myfile</firstname> <lastname>yourfile</lastname> </record> know i want...
2
by: jianbing.chen | last post by:
Hi, I have this weird situation where on the same machine(solaris 8, python 2.5), one user can do this with no problem: <module 'libxml2' from '/usr/local/lib/python2.5/site-packages/...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 4 Oct 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
tracyyun
by: tracyyun | last post by:
Hello everyone, I have a question and would like some advice on network connectivity. I have one computer connected to my router via WiFi, but I have two other computers that I want to be able to...
2
by: giovanniandrean | last post by:
The energy model is structured as follows and uses excel sheets to give input data: 1-Utility.py contains all the functions needed to calculate the variables and other minor things (mentions...
4
NeoPa
by: NeoPa | last post by:
Hello everyone. I find myself stuck trying to find the VBA way to get Access to create a PDF of the currently-selected (and open) object (Form or Report). I know it can be done by selecting :...
3
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be using a very simple database which has Form (clsForm) & Report (clsReport) classes that simply handle making the calling Form invisible until the Form, or all...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 1 Nov 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM) Please note that the UK and Europe revert to winter time on...
3
by: nia12 | last post by:
Hi there, I am very new to Access so apologies if any of this is obvious/not clear. I am creating a data collection tool for health care employees to complete. It consists of a number of...
4
by: GKJR | last post by:
Does anyone have a recommendation to build a standalone application to replace an Access database? I have my bookkeeping software I developed in Access that I would like to make available to other...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.