Pluggability of SAX parsers into DOM in JAXP

erik_midtskogen

Hi Folks,

I'm writing a general-purpose HTML screen-scraping framework in Java
(scrape new web sites without writing new code, yada yada...), and I
want to use the JAXP DOM api along with XPath and XSLT for most of my
business logic. I actually hope to make this an open-source project if
I can ever get it to some reasonable level of usability.

My problem is that, since the slurry pumped out by most web sites bears
only the faintest resemblance to HTML--let alone XML--I need to use a
special-purpose SAX parser that is intentionally not fully SAX
compliant (since it accepts malformed documents).

I already know how to set the system property for an arbitrary SAX
parser when programming to the SAX API (i.e. when calling
SAXParserFactor y.newInstance() ), and I also know how to specify an
arbitrary DocumentBuilder Factory when using DOM. So, how do I specify
the SAX parser that I want DOM to use "behind the scenes"?

My expectation was that the JAXP DOM implementation should be a client
of the JAXP SAX implementation. I could be wrong about this, though.
I'm looking at the code now, and although it's a bit hard to follow
(and my Eclipse debugger bugs out at just the wrong moment), it appears
as if the default JAXP DocumentBuilder Factory is hard-coded to use an
org.apache.xerc es.parsers.XML1 1Configuration as a SAX parser. Weird.
I could be mistaken about this, but if it's true, then this is not my
idea of pluggability.

So here's where I am so far: I wrote a custom SAXParserFactor y to
create an instance of my custom SAX parser, and I plugged it in and
tested it out using the SAX API and it worked just fine. But then when
I tried using the DOM API for my XPath/XSLT processing, specifying my
custom SAXParserFactor y as before, I found that the JAXP DOM
implementation did not use the SAXParserFactor y I had specified, and so
obviously, didn't use the SAX parser I wanted.

I could try building my own DocumentBuilder Factory, but that looks like
an awful lot of work just to plug in a SAX parser. Does anyone here
know of an easier way?

Much thanks in advance.

Nov 29 '06 #1

Subscribe Reply

1555

Joseph Kesselman

er************* @anntaylor.com wrote:

I could try building my own DocumentBuilder Factory, but that looks like
an awful lot of work just to plug in a SAX parser. Does anyone here
know of an easier way?

There are many off-the-shelf construct-a-DOM-from-a-SAX-stream
implementations . Shouldn't be hard to find one if you do a bit of
websearching. Plug in a SAX parser and a generic DOM implementation and
push the button.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden

Nov 29 '06 #2

erik_midtskogen

Thanks Joe,

Actually, I have tried SAX2DOM from the Xalan project. It works, but
this utility seems to want to add namespaces to my DOM, and can't turn
this feature off. Correct though the namespaces may be, they add
needless complexity to the required XPath expressions and XSLT files
that are used to configure the framework to scrape a site. I'm trying
to make my framework as easy to use as possible.

Also, I like the idea of sticking to the standard SAX and DOM API's
because I want to keep my options as open as possible by programming to
interfaces instead of implementation classes. But if there is no easy
way of setting a system property to tell the standard JAXP DOM
implementation what SAX parser to use without making a big project out
of it, then I guess I'll go back to converting the SAX stream to a DOM
programatically .

Thanks,
--Erik
Joseph Kesselman wrote:

er************* @anntaylor.com wrote:
I could try building my own DocumentBuilder Factory, but that looks like
an awful lot of work just to plug in a SAX parser. Does anyone here
know of an easier way?

There are many off-the-shelf construct-a-DOM-from-a-SAX-stream
implementations . Shouldn't be hard to find one if you do a bit of
websearching. Plug in a SAX parser and a generic DOM implementation and
push the button.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden

Nov 29 '06 #3

Joseph Kesselman

er************* @anntaylor.com wrote:

Actually, I have tried SAX2DOM from the Xalan project. It works, but
this utility seems to want to add namespaces to my DOM, and can't turn
this feature off. Correct though the namespaces may be, they add
needless complexity to the required XPath expressions and XSLT files
that are used to configure the framework to scrape a site. I'm trying
to make my framework as easy to use as possible.

SAX2DOM shouldn't be adding namespaces unless the namespaces are present
in the SAX input -- in which case leaving them out is Absolutely
Incorrect; you'd be changing the meaning of the document (since the
namespaces are part of the document's semantics) and this bad practice
*WILL* eventually turn around and bite your kneecaps off.

Everything should be as simple as possible... but not simpler!

But if there is no easy
way of setting a system property to tell the standard JAXP DOM
implementation what SAX parser to use

The JAXP DOM path may not be using a SAX parser under the covers -- for
example, Xerces drives both SAX and DOM output off a lower-level
representation -- so there really isn't a plug-in point that maps to
what you're asking for. Using a separate SAX-driven DOM builder really
is likely to be the most portable solution. It's a pretty simple piece
of code, and since it's based entirely on the SAX and DOM specs it's
highly portable.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden

Nov 29 '06 #4

erik_midtskogen

Hi Joe,

OK, I guess I'll go back to programatically performing the conversion
with a utility. I haven't yet figured out for sure where the
namespaces are actually coming from. I'll have to look into it.

While I agree with you that stripping namespaces out would have
problematic consequences if I were parsing general-purpose xml (and if
I cared about the element type in which a certain bit of data was
found), in this particular case it really is safe to ignore them
because of the nature of what I'm doing. I'm parsing html to scrape
out textual data. Namespaces aren't normally used in html--in fact,
not even in xhtml--to distinguish one element type from another. You
could conceivably use namespaces in xhtml, but there would be no
practical purpose in doing so. If you did so in a way that assigned an
element to a namespace other than http://www.w3c.org/TR/xhtml1 (or
something like that), no user agent would know what to do with it.

Even if namespaces were customarily used by web browsers to distinguish
between elements (such as might happen with inline SVG content), it
still might not make a difference to me because I don't actually care
what element type the data comes from. I'm really just using XPath and
XSLT as a more powerful alternative to fishing stuff out of the stream
using Perl scripting with regular expressions.

I'm generally pretty anal about this type of thing. Sloppiness and
ignorance in technical matters drives me crazy. It's one reason I hate
Microsoft. But in this case, it's more important to me that users of
my framework be able to write XPath expressions into the configuration
files without having to specify the same namespace prefix in all their
location steps. As long as I can write an XPath expression to identify
navigational elements and XSLT templates to scrape out the content, I'm
happy.

Thanks for your help.
--Erik

Joseph Kesselman wrote:

er************* @anntaylor.com wrote:
Actually, I have tried SAX2DOM from the Xalan project. It works, but
this utility seems to want to add namespaces to my DOM, and can't turn
this feature off. Correct though the namespaces may be, they add
needless complexity to the required XPath expressions and XSLT files
that are used to configure the framework to scrape a site. I'm trying
to make my framework as easy to use as possible.

SAX2DOM shouldn't be adding namespaces unless the namespaces are present
in the SAX input -- in which case leaving them out is Absolutely
Incorrect; you'd be changing the meaning of the document (since the
namespaces are part of the document's semantics) and this bad practice
*WILL* eventually turn around and bite your kneecaps off.

Everything should be as simple as possible... but not simpler!

But if there is no easy
way of setting a system property to tell the standard JAXP DOM
implementation what SAX parser to use

The JAXP DOM path may not be using a SAX parser under the covers -- for
example, Xerces drives both SAX and DOM output off a lower-level
representation -- so there really isn't a plug-in point that maps to
what you're asking for. Using a separate SAX-driven DOM builder really
is likely to be the most portable solution. It's a pretty simple piece
of code, and since it's based entirely on the SAX and DOM specs it's
highly portable.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden

Nov 29 '06 #5

Joe Kesselman

er************* @anntaylor.com wrote:

Namespaces aren't normally used in html

HTML is based on SGML, which doesn't have the concept of namespaces.
XHTML is based on XML, which does.

could conceivably use namespaces in xhtml, but there would be no
practical purpose in doing so.

That's absolutely incorrect. Namespaces are essential when XHTML is
intermixed with other vocabularies -- MathML, SVG, and so on. That's
becoming more common.

For that reason, the XHTML elements themselves need to be in the correct
namespace (http://www.w3c.org/TR/xhtml1, as you pointed out).

Yes, it may not matter in your particular case. Or it may not matter
_yet_, which I submit is likely to be a more accurate statement unless
this is throw-away code.

But in this case, it's more important to me that users of
my framework be able to write XPath expressions into the configuration
files without having to specify the same namespace prefix in all their
location steps.

Alternative suggestion: Use an XPath 2.0/XSLT 2.0 implementation, where
the concept of default namespace is meaningful. That would let your
users leave out prefixes yet still get results which are completely
correct per the standards.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Nov 29 '06 #6

Johannes Koch

Joe Kesselman schrieb:

For that reason, the XHTML elements themselves need to be in the correct
namespace (http://www.w3c.org/TR/xhtml1, as you pointed out).

The namespace URI for XHTML 1.x is <http://www.w3.org/1999/xhtml>.

--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)

Nov 30 '06 #7

Joe Kesselman

Johannes Koch wrote:

The namespace URI for XHTML 1.x is <http://www.w3.org/1999/xhtml>.

Blush. Yes. Sorry; copied that from the question and didn't stop to
recheck it. That's what I get for posting in a hurry...
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Nov 30 '06 #8

erik_midtskogen

Hi Joe,

Thank you for your clarifications and suggestion. I will definitely
look into using XPath/XSLT 2.0. I was looking for a way of
incorporating a default namespace into XPath expressions and XSLT
transforms, but was surprised to discover that this concept hadn't been
addressed in the previous version. It is definitely my preference to
have the capability of dealing with namespaces in my framework if this
can be done without making it harder to use for the 99% of the cases
where namespaces are irrelevant.

Thanks,
--Erik

Joe Kesselman wrote:

er************* @anntaylor.com wrote:
Namespaces aren't normally used in html

HTML is based on SGML, which doesn't have the concept of namespaces.
XHTML is based on XML, which does.

could conceivably use namespaces in xhtml, but there would be no
practical purpose in doing so.

That's absolutely incorrect. Namespaces are essential when XHTML is
intermixed with other vocabularies -- MathML, SVG, and so on. That's
becoming more common.

For that reason, the XHTML elements themselves need to be in the correct
namespace (http://www.w3c.org/TR/xhtml1, as you pointed out).

Yes, it may not matter in your particular case. Or it may not matter
_yet_, which I submit is likely to be a more accurate statement unless
this is throw-away code.

But in this case, it's more important to me that users of
my framework be able to write XPath expressions into the configuration
files without having to specify the same namespace prefix in all their
location steps.

Alternative suggestion: Use an XPath 2.0/XSLT 2.0 implementation, where
the concept of default namespace is meaningful. That would let your
users leave out prefixes yet still get results which are completely
correct per the standards.

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Dec 1 '06 #9

Similar topics

1634

Difference between pure DOM and JAXP over DOM ??

by: Thorsten Meininger | last post by:

As far as I know one could access DOM trees through "pure" DOM and through JAXP using DOM. I have seen dozends of examples but I could not find any differences between the two methods. Which java statement s are typically for JAXP and which for pure DOM? For example could one say that all source code examples using Factories (DocumentBuilderFactory for DOM resp. SAXParserFactory for SAX) are JAXP method based ??? That would mean all...

.NET Framework

5611

How to: JAXP, Xerces 2.6.2 and XML Schema ?

by: joes | last post by:

Hello there I tried for several days to get a simple validation with xml schema & xerces working. Goal for me is tuse JAXP and not specific Xerces classes. I don't get the point what I am doing wrong. Could somebody help me? I didn't find any full example working on the net. Thank you for any hints! If I run the examples below, the parsers parses the file well, no vlaidation is occuring although the schema and xml file does not

.NET Framework

1484

JAXP : SAX to DOM

by: Philippe Poulard | last post by:

hi, i have to build a DOM tree from a SAX flow what is the best way to perform this with JAXP ? i think that the copy transformer suits that problem (SAXSource to DOMResult), but perhaps i miss something more simpler ? any advice ? --

.NET Framework

1462

JAXP and WebLogic Server 8.1

by: benoit | last post by:

Hi, When it's written in WebLogic Server 8.1 documentation that this software is based on JDK 1.4.1, and is JAXP 1.1 compliant, (http://edocs.bea.com/wls/docs81/xml/xml_intro.html#189256), does it mean that only JAXP 1.1 can be used (and therefore the latest Xerces and Xalan releases are forbidden, because JAXP 1.2 compliant), or with the famous "Endorsed Standards Override Mechanism"...

.NET Framework

3741

Strange xml.parsers.xml import problem

by: dwelch91 | last post by:

Hi, c.l.p.'ers- I am having a problem with the import of xml.parsers.expat that has gotten me completely stumped. I have two programs, one a PyQt program and one a command line (text) program that both eventually call the same code that imports xml.parsers.expat. Both give me different results... The code that gets called is (print statements have been added for debugging):

Python

1225

JAXP, SAXParser with specific Configuration

by: Undeclared | last post by:

Hello! My goal is to use JAXP for creating SAX parser with my own XMLParserConfiguration. For example, in package org.apache.xerces.parsers there is a constructor: public SAXParser(XMLParserConfiguration config); But I don't want to use Xerces classes, only JAXP. Thank you.

XML

2006

Why Sun introduced JAXP and SAX???

by: dmjpro | last post by:

Is anything new in SAX??? Is there any problem with JAXP??? Then why SUN introduced two techniques JAXP and SAX???? Plz explain. Kind regards, Dmjpro.

Java

3194

Compilers - 5B: Parsers

by: JosAH | last post by:

Greetings, welcome back at the sequel of the parsers article chapter. This part is dedicated to the ExpressionParser, the largest parser class for our little language. This class parses a complete expression and just like the other parser classes calls the Generator on the fly. When the parse has completed successfully, the generated code is returned. Otherwise the parse is aborted and an exception is thrown telling the reason of the...

Java

3380

Compilers - 5A: Parsers

by: JosAH | last post by:

Greetings, this week's article part discusses the parsers used for our little language. We will implement the parsers according to the grammar rules we defined in the second part of this article. As you will see shortly, the implementation of the parsers is an almost one-to-one translation of those grammar rules. Recursive descent parsing The grammar rules are highly recursive, i.e. one rule mentions another rule

Java

10346

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10157

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10096

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8982

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6742

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5386

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5514

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4055

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2887

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General