473,411 Members | 1,998 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,411 software developers and data experts.

Converting HTML to XHTML (JTidy,OpenXML,Xerces)

Hi,

After 2 weeks of search/hit-and-trial I finally thought to revert to
the group to find solution to my problem.(something I should have done
much earlier)

This is the deal :

On a JSP page, I want to grab a URL and parse /change the HTML and send
it to the JSP page. I take the URL from the user in a textbox (not the
browser location box).

In the Java class file (that I have imported in JSP), I tried to use
Xerces parser earlier till I realised it only supports well-formed XML.

So I switched to OpenXML which supports HTML (but it took like 10
minutes to parse it and after that also it gave me the Out of Memory
Exception - even when I increased the buffer size of Tomcat to a good
amount and when I was parsing a page as simple as www.google.com)
But if I dont use the DOCUMENT_HTML option in OpenXML and just treat
the HTML as normal XML file, it does parse it properly(maybe it skips
the non terminated tags) but there's no way to return the XML back to
the browser because doc.getDocumentElement().toString() returns '[html]
1 nodes'

So then I switched to Jtidy and tried to convert HTML to XHTML. But it
seems the Document type returned by JTidy doesnt support most standard
document methods (including converting XML to string using
doc.getDocumentElement().toString()) leaving me at the same place where
I started from.

Can anybody suggest me what can be a good idea to approach my problem.
All that I want to do is grab a URL's HTML, add some tags to it (a
couple of appendChild()s) and then send the HTML back to the user to
be displayed(intrepreted) on the browser.

I'll be really thankful for your help!
Anupam

Mar 23 '06 #1
9 6685
an********@gmail.com wrote:
Hi,

After 2 weeks of search/hit-and-trial I finally thought to revert to
the group to find solution to my problem.(something I should have done
much earlier)

This is the deal :

On a JSP page, I want to grab a URL and parse /change the HTML and send
it to the JSP page. I take the URL from the user in a textbox (not the
browser location box).

In the Java class file (that I have imported in JSP), I tried to use
Xerces parser earlier till I realised it only supports well-formed XML.

So I switched to OpenXML which supports HTML (but it took like 10
minutes to parse it and after that also it gave me the Out of Memory
Exception - even when I increased the buffer size of Tomcat to a good
amount and when I was parsing a page as simple as www.google.com)
But if I dont use the DOCUMENT_HTML option in OpenXML and just treat
the HTML as normal XML file, it does parse it properly(maybe it skips
the non terminated tags) but there's no way to return the XML back to
the browser because doc.getDocumentElement().toString() returns '[html]
1 nodes'

So then I switched to Jtidy and tried to convert HTML to XHTML. But it
seems the Document type returned by JTidy doesnt support most standard
document methods (including converting XML to string using
doc.getDocumentElement().toString()) leaving me at the same place where
I started from.

Can anybody suggest me what can be a good idea to approach my problem.
All that I want to do is grab a URL's HTML, add some tags to it (a
couple of appendChild()s) and then send the HTML back to the user to
be displayed(intrepreted) on the browser.

I'll be really thankful for your help!
Anupam

hi,

I did exactly the same thing with NekoHTML : parsing the HTML to XML,
then selecting some nodes with XPath, appending/replacing some nodes,
and transforming or serializing it back to HTML
http://people.apache.org/~andyc/neko/doc/html/index.html
(a nice tool)

--------------------------------------------

Did you think on a full XML solution ?

With Active Tags I used some tags/actions to achieve this. For this
purpose you could use RefleX at the top of Tomcat :
http://reflex.gforge.inria.fr/
(a nice tool too)
RefleX comes with a servlet that can run Active Tags

Your code would then look like this :
<web:service
xmlns:web="http://www.inria.fr/xml/active-tags/web"
xmlns:io="http://www.inria.fr/xml/active-tags/io"
xmlns:xcl="http://www.inria.fr/xml/active-tags/xcl"
xmlns:xhtml="http://www.w3.org/1999/xhtml"

<!--understand it as a HTTP service-->

<!--things that are performed when the server starts-->
<web:init>
<!--share a stylesheet with all HTTP requests-->
<xcl:parse-stylesheet name="ralyx.xsl"
source="web:///WEB-INF/xslt/ralyx.xsl" scope="shared"/>
</web:init>

<!--map the URL-path with a regexp-->
<web:mapping
match="^/(\d{4})/Fiches/([\p{Lower}\d\-_+]+)/\2\.(html|xml)$"
method="GET" mime-type="">
<!--use an HTML parser because the documents are not
well-formed ; <xcl:parse-html> uses NekoHTML-->
<xcl:parse-html name="fiche"
source="http://www.inria.fr/recherche/equipes/{ $web:match/node()[ 2 ]
}.en.html"/>
<xcl:set name="corps" value="{
$fiche//xhtml:DIV[@class='corps'] }"/>
<xcl:set name="about" value="{ $corps/xhtml:TABLE//xhtml:TD[2] }"/>
<xcl:replace referent="{ $about }">
<td width="200px" align="right" class="projet">
<div class="menu_box">{ $about/node() }</div>
</td>
</xcl:replace>

<!--rebuild a new document-->
<xcl:document name="projet">
<projet xml="xml" title="{ string(
$corps/preceding-sibling::xhtml:H1 ) }">
{ $corps }
</projet>
</xcl:document>

<!--relativizing URLs in <A href> and <IMG src>-->
<xcl:for-each name="link" select="{ $projet//xhtml:A[@href] }">
<xcl:attribute referent="{ $link }" name="href" value="{
io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
$link/@href ) ) }"/>
</xcl:for-each>
<xcl:for-each name="link" select="{ $projet//xhtml:IMG[@src] }">
<xcl:attribute referent="{ $link }" name="src" value="{
io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
$link/@src ) ) }"/>
</xcl:for-each>

<!--selecting the stylesheet-->
<xcl:set xcl:if="{ $web:match/node()[ 3 ] != 'xml' }"
name="xslt" value="{ $ralyx.xsl }"/>
<!--back to the browser-->
<xcl:transform
output="{ value( $web:response/@web:output ) }"
source="{ $projet }"
stylesheet="{ $xslt }"
/>
</web:mapping>
</web:service>

the result is a new HTML document that contains an updated-part of
another HTML document (this mapping act almost like a proxy) ; it is
used in a real-application deployed at INRIA

to use it, simply declares the ReflexServlet in Tomcat :
<web-app>
<display-name>RefleX application</display-name>
<description>My RefleX application</description>
<servlet>
<servlet-name>ReflexServlet</servlet-name>
<display-name>RefleX servlet</display-name>
<description>Runs an Active Sheet</description>
<servlet-class>org.inria.reflex.ReflexServlet</servlet-class>
<init-param>
<param-name>activeSheetPath</param-name>
<param-value>web:///WEB-INF/active-sheet.xml</param-value>
</init-param>
<load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping><!--custom mappings-->
<url-pattern>*.gif</url-pattern>
<servlet-name>default</servlet-name>
</servlet-mapping>
<servlet-mapping><!--RefleX mapping-->
<servlet-name>ReflexServlet</servlet-name>
<url-pattern>/</url-pattern>
</servlet-mapping>
</web-app>

when downloading RefleX, check the dependencies and ensure that NekoHTML
0.9.5 is in the full distribution : for the moment, the last version of
RefleX available (0.1.2) uses NekoHTML 0.9.4 that is bugged regarding
namespace URIs ; this issue is fixed in NekoHTML 0.9.5 which is
available online and that will be in RefleX 0.1.3 (coming soon) ;

Enjoy :)

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
Mar 23 '06 #2
>> seems the Document type returned by JTidy doesnt support most standard
document methods (including converting XML to string using
doc.getDocumentElement().toString()) leaving me at the same place where
I started from.


Please note that using .toString() to get the XML is *NOT* part of the
W3C DOM spec; it's a feature of one specific DOM implementation.

See DOM Level 3's serialization API, or see the documentation that comes
with a particular DOM for information about what serialization tools it
provides/recommends.
--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Mar 23 '06 #3
BTW, the W3C's Tidy tool can output serialized XHTML/XML directly rather
than as a DOM; is there a reason you're reinventing that wheel?

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Mar 23 '06 #4
Joe Kesselman wrote:
BTW, the W3C's Tidy tool can output serialized XHTML/XML directly rather
than as a DOM; is there a reason you're reinventing that wheel?

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry


Because I want to 'edit' the XHTML returned, by adding a couple of tags
here and there (using DOM methods like appendChild() )and after I get
my desired DOM structure I want to return it as a string.

- Anupam

Mar 23 '06 #5
an********@gmail.com wrote:
Because I want to 'edit' the XHTML returned, by adding a couple of tags
here and there (using DOM methods like appendChild() )and after I get
my desired DOM structure I want to return it as a string.


OK; in that case you either want the DOM Level 3 serializer methods (if
they're supported) or an off-the-shelf DOM serializer... or, possibly,
to write your editing operations as a stylesheet, pass the DOM to an
XSLT processor, and let _its_ serializer deal with the problem.

--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden
Mar 23 '06 #6

Philippe Poulard wrote:
an********@gmail.com wrote:
Hi,

After 2 weeks of search/hit-and-trial I finally thought to revert to
the group to find solution to my problem.(something I should have done
much earlier)

This is the deal :

On a JSP page, I want to grab a URL and parse /change the HTML and send
it to the JSP page. I take the URL from the user in a textbox (not the
browser location box).

In the Java class file (that I have imported in JSP), I tried to use
Xerces parser earlier till I realised it only supports well-formed XML.

So I switched to OpenXML which supports HTML (but it took like 10
minutes to parse it and after that also it gave me the Out of Memory
Exception - even when I increased the buffer size of Tomcat to a good
amount and when I was parsing a page as simple as www.google.com)
But if I dont use the DOCUMENT_HTML option in OpenXML and just treat
the HTML as normal XML file, it does parse it properly(maybe it skips
the non terminated tags) but there's no way to return the XML back to
the browser because doc.getDocumentElement().toString() returns '[html]
1 nodes'

So then I switched to Jtidy and tried to convert HTML to XHTML. But it
seems the Document type returned by JTidy doesnt support most standard
document methods (including converting XML to string using
doc.getDocumentElement().toString()) leaving me at the same place where
I started from.

Can anybody suggest me what can be a good idea to approach my problem.
All that I want to do is grab a URL's HTML, add some tags to it (a
couple of appendChild()s) and then send the HTML back to the user to
be displayed(intrepreted) on the browser.

I'll be really thankful for your help!
Anupam


hi,

I did exactly the same thing with NekoHTML : parsing the HTML to XML,
then selecting some nodes with XPath, appending/replacing some nodes,
and transforming or serializing it back to HTML
http://people.apache.org/~andyc/neko/doc/html/index.html
(a nice tool)

--------------------------------------------

Did you think on a full XML solution ?

With Active Tags I used some tags/actions to achieve this. For this
purpose you could use RefleX at the top of Tomcat :
http://reflex.gforge.inria.fr/
(a nice tool too)
RefleX comes with a servlet that can run Active Tags

Your code would then look like this :
<web:service
xmlns:web="http://www.inria.fr/xml/active-tags/web"
xmlns:io="http://www.inria.fr/xml/active-tags/io"
xmlns:xcl="http://www.inria.fr/xml/active-tags/xcl"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
>

<!--understand it as a HTTP service-->

<!--things that are performed when the server starts-->
<web:init>
<!--share a stylesheet with all HTTP requests-->
<xcl:parse-stylesheet name="ralyx.xsl"
source="web:///WEB-INF/xslt/ralyx.xsl" scope="shared"/>
</web:init>

<!--map the URL-path with a regexp-->
<web:mapping
match="^/(\d{4})/Fiches/([\p{Lower}\d\-_+]+)/\2\.(html|xml)$"
method="GET" mime-type="">
<!--use an HTML parser because the documents are not
well-formed ; <xcl:parse-html> uses NekoHTML-->
<xcl:parse-html name="fiche"
source="http://www.inria.fr/recherche/equipes/{ $web:match/node()[ 2 ]
}.en.html"/>
<xcl:set name="corps" value="{
$fiche//xhtml:DIV[@class='corps'] }"/>
<xcl:set name="about" value="{ $corps/xhtml:TABLE//xhtml:TD[2] }"/>
<xcl:replace referent="{ $about }">
<td width="200px" align="right" class="projet">
<div class="menu_box">{ $about/node() }</div>
</td>
</xcl:replace>

<!--rebuild a new document-->
<xcl:document name="projet">
<projet xml="xml" title="{ string(
$corps/preceding-sibling::xhtml:H1 ) }">
{ $corps }
</projet>
</xcl:document>

<!--relativizing URLs in <A href> and <IMG src>-->
<xcl:for-each name="link" select="{ $projet//xhtml:A[@href] }">
<xcl:attribute referent="{ $link }" name="href" value="{
io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
$link/@href ) ) }"/>
</xcl:for-each>
<xcl:for-each name="link" select="{ $projet//xhtml:IMG[@src] }">
<xcl:attribute referent="{ $link }" name="src" value="{
io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
$link/@src ) ) }"/>
</xcl:for-each>

<!--selecting the stylesheet-->
<xcl:set xcl:if="{ $web:match/node()[ 3 ] != 'xml' }"
name="xslt" value="{ $ralyx.xsl }"/>
<!--back to the browser-->
<xcl:transform
output="{ value( $web:response/@web:output ) }"
source="{ $projet }"
stylesheet="{ $xslt }"
/>
</web:mapping>
</web:service>

the result is a new HTML document that contains an updated-part of
another HTML document (this mapping act almost like a proxy) ; it is
used in a real-application deployed at INRIA

to use it, simply declares the ReflexServlet in Tomcat :
<web-app>
<display-name>RefleX application</display-name>
<description>My RefleX application</description>
<servlet>
<servlet-name>ReflexServlet</servlet-name>
<display-name>RefleX servlet</display-name>
<description>Runs an Active Sheet</description>
<servlet-class>org.inria.reflex.ReflexServlet</servlet-class>
<init-param>
<param-name>activeSheetPath</param-name>
<param-value>web:///WEB-INF/active-sheet.xml</param-value>
</init-param>
<load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping><!--custom mappings-->
<url-pattern>*.gif</url-pattern>
<servlet-name>default</servlet-name>
</servlet-mapping>
<servlet-mapping><!--RefleX mapping-->
<servlet-name>ReflexServlet</servlet-name>
<url-pattern>/</url-pattern>
</servlet-mapping>
</web-app>

when downloading RefleX, check the dependencies and ensure that NekoHTML
0.9.5 is in the full distribution : for the moment, the last version of
RefleX available (0.1.2) uses NekoHTML 0.9.4 that is bugged regarding
namespace URIs ; this issue is fixed in NekoHTML 0.9.5 which is
available online and that will be in RefleX 0.1.3 (coming soon) ;

Enjoy :)

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !

Thanks so much Philippe. I'll try it and get back

Thanks again,
Anupam

Mar 24 '06 #7
I am not able to build nekohtml properly. After installing everything
it required and moving all the jar files to it's lib folder, it gives
me this error when i try to build it :

build -f build-html.xml
Buildfile: build-html.xml

version-init:
[mkdir] Created dir: C:\Documents and Settings\Anupam
Jain\Desktop\nekohtml-0.9.5\bin\html\src\org\cyberneko\html

version:
[echo] Generating bin/html/src/org/cyberneko/html/Version.java
[echo] Generating bin/html/src/MANIFEST_html

compile:
[javac] Compiling 26 source files to C:\Documents and
Settings\Anupam Jain\Desktop\nekohtml-0.9.5\bin\html
[javac] C:\Documents and Settings\Anupam
Jain\Desktop\nekohtml-0.9.5\src\html\org\cyberneko\html\HTMLScanner.java :89:
org.cyberneko.html.HTM
LScanner is not abstract and does not override abstract method
getXMLVersion() in org.apache.xerces.xni.XMLLocator
[javac] public class HTMLScanner
[javac] ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 1 error

BUILD FAILED
C:\Documents and Settings\Anupam
Jain\Desktop\nekohtml-0.9.5\build-html.xml:51: Compile failed; see the
compiler error output for details.

Total time: 16 seconds
So basically the error is : org.cyberneko.html.HTMLScanner is not
abstract and does not override abstract method getXMLVersion() in
org.apache.xerces.xni.XMLLocator

- Anupam

Philippe Poulard wrote: an********@gmail.com wrote:
Hi,

After 2 weeks of search/hit-and-trial I finally thought to revert to
the group to find solution to my problem.(something I should have done
much earlier)

This is the deal :

On a JSP page, I want to grab a URL and parse /change the HTML and send
it to the JSP page. I take the URL from the user in a textbox (not the
browser location box).

In the Java class file (that I have imported in JSP), I tried to use
Xerces parser earlier till I realised it only supports well-formed XML.

So I switched to OpenXML which supports HTML (but it took like 10
minutes to parse it and after that also it gave me the Out of Memory
Exception - even when I increased the buffer size of Tomcat to a good
amount and when I was parsing a page as simple as www.google.com)
But if I dont use the DOCUMENT_HTML option in OpenXML and just treat
the HTML as normal XML file, it does parse it properly(maybe it skips
the non terminated tags) but there's no way to return the XML back to
the browser because doc.getDocumentElement().toString() returns '[html]
1 nodes'

So then I switched to Jtidy and tried to convert HTML to XHTML. But it
seems the Document type returned by JTidy doesnt support most standard
document methods (including converting XML to string using
doc.getDocumentElement().toString()) leaving me at the same place where
I started from.

Can anybody suggest me what can be a good idea to approach my problem.
All that I want to do is grab a URL's HTML, add some tags to it (a
couple of appendChild()s) and then send the HTML back to the user to
be displayed(intrepreted) on the browser.

I'll be really thankful for your help!
Anupam


hi,

I did exactly the same thing with NekoHTML : parsing the HTML to XML,
then selecting some nodes with XPath, appending/replacing some nodes,
and transforming or serializing it back to HTML
http://people.apache.org/~andyc/neko/doc/html/index.html
(a nice tool)

--------------------------------------------

Did you think on a full XML solution ?

With Active Tags I used some tags/actions to achieve this. For this
purpose you could use RefleX at the top of Tomcat :
http://reflex.gforge.inria.fr/
(a nice tool too)
RefleX comes with a servlet that can run Active Tags

Your code would then look like this :
<web:service
xmlns:web="http://www.inria.fr/xml/active-tags/web"
xmlns:io="http://www.inria.fr/xml/active-tags/io"
xmlns:xcl="http://www.inria.fr/xml/active-tags/xcl"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
>

<!--understand it as a HTTP service-->

<!--things that are performed when the server starts-->
<web:init>
<!--share a stylesheet with all HTTP requests-->
<xcl:parse-stylesheet name="ralyx.xsl"
source="web:///WEB-INF/xslt/ralyx.xsl" scope="shared"/>
</web:init>

<!--map the URL-path with a regexp-->
<web:mapping
match="^/(\d{4})/Fiches/([\p{Lower}\d\-_+]+)/\2\.(html|xml)$"
method="GET" mime-type="">
<!--use an HTML parser because the documents are not
well-formed ; <xcl:parse-html> uses NekoHTML-->
<xcl:parse-html name="fiche"
source="http://www.inria.fr/recherche/equipes/{ $web:match/node()[ 2 ]
}.en.html"/>
<xcl:set name="corps" value="{
$fiche//xhtml:DIV[@class='corps'] }"/>
<xcl:set name="about" value="{ $corps/xhtml:TABLE//xhtml:TD[2] }"/>
<xcl:replace referent="{ $about }">
<td width="200px" align="right" class="projet">
<div class="menu_box">{ $about/node() }</div>
</td>
</xcl:replace>

<!--rebuild a new document-->
<xcl:document name="projet">
<projet xml="xml" title="{ string(
$corps/preceding-sibling::xhtml:H1 ) }">
{ $corps }
</projet>
</xcl:document>

<!--relativizing URLs in <A href> and <IMG src>-->
<xcl:for-each name="link" select="{ $projet//xhtml:A[@href] }">
<xcl:attribute referent="{ $link }" name="href" value="{
io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
$link/@href ) ) }"/>
</xcl:for-each>
<xcl:for-each name="link" select="{ $projet//xhtml:IMG[@src] }">
<xcl:attribute referent="{ $link }" name="src" value="{
io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
$link/@src ) ) }"/>
</xcl:for-each>

<!--selecting the stylesheet-->
<xcl:set xcl:if="{ $web:match/node()[ 3 ] != 'xml' }"
name="xslt" value="{ $ralyx.xsl }"/>
<!--back to the browser-->
<xcl:transform
output="{ value( $web:response/@web:output ) }"
source="{ $projet }"
stylesheet="{ $xslt }"
/>
</web:mapping>
</web:service>

the result is a new HTML document that contains an updated-part of
another HTML document (this mapping act almost like a proxy) ; it is
used in a real-application deployed at INRIA

to use it, simply declares the ReflexServlet in Tomcat :
<web-app>
<display-name>RefleX application</display-name>
<description>My RefleX application</description>
<servlet>
<servlet-name>ReflexServlet</servlet-name>
<display-name>RefleX servlet</display-name>
<description>Runs an Active Sheet</description>
<servlet-class>org.inria.reflex.ReflexServlet</servlet-class>
<init-param>
<param-name>activeSheetPath</param-name>
<param-value>web:///WEB-INF/active-sheet.xml</param-value>
</init-param>
<load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping><!--custom mappings-->
<url-pattern>*.gif</url-pattern>
<servlet-name>default</servlet-name>
</servlet-mapping>
<servlet-mapping><!--RefleX mapping-->
<servlet-name>ReflexServlet</servlet-name>
<url-pattern>/</url-pattern>
</servlet-mapping>
</web-app>

when downloading RefleX, check the dependencies and ensure that NekoHTML
0.9.5 is in the full distribution : for the moment, the last version of
RefleX available (0.1.2) uses NekoHTML 0.9.4 that is bugged regarding
namespace URIs ; this issue is fixed in NekoHTML 0.9.5 which is
available online and that will be in RefleX 0.1.3 (coming soon) ;

Enjoy :)

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !


Mar 24 '06 #8
an********@gmail.com wrote:
I am not able to build nekohtml properly. After installing everything
it required and moving all the jar files to it's lib folder, it gives
me this error when i try to build it :
Hi,

I'm sure you'll get some help by contacting the developper

However, you could also try to use directly the .jar available in the
package


build -f build-html.xml

Buildfile: build-html.xml

version-init:
[mkdir] Created dir: C:\Documents and Settings\Anupam
Jain\Desktop\nekohtml-0.9.5\bin\html\src\org\cyberneko\html

version:
[echo] Generating bin/html/src/org/cyberneko/html/Version.java
[echo] Generating bin/html/src/MANIFEST_html

compile:
[javac] Compiling 26 source files to C:\Documents and
Settings\Anupam Jain\Desktop\nekohtml-0.9.5\bin\html
[javac] C:\Documents and Settings\Anupam
Jain\Desktop\nekohtml-0.9.5\src\html\org\cyberneko\html\HTMLScanner.java :89:
org.cyberneko.html.HTM
LScanner is not abstract and does not override abstract method
getXMLVersion() in org.apache.xerces.xni.XMLLocator
[javac] public class HTMLScanner
[javac] ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 1 error

BUILD FAILED
C:\Documents and Settings\Anupam
Jain\Desktop\nekohtml-0.9.5\build-html.xml:51: Compile failed; see the
compiler error output for details.

Total time: 16 seconds
So basically the error is : org.cyberneko.html.HTMLScanner is not
abstract and does not override abstract method getXMLVersion() in
org.apache.xerces.xni.XMLLocator

- Anupam

Philippe Poulard wrote:
an********@gmail.com wrote:
Hi,

After 2 weeks of search/hit-and-trial I finally thought to revert to
the group to find solution to my problem.(something I should have done
much earlier)

This is the deal :

On a JSP page, I want to grab a URL and parse /change the HTML and send
it to the JSP page. I take the URL from the user in a textbox (not the
browser location box).

In the Java class file (that I have imported in JSP), I tried to use
Xerces parser earlier till I realised it only supports well-formed XML.

So I switched to OpenXML which supports HTML (but it took like 10
minutes to parse it and after that also it gave me the Out of Memory
Exception - even when I increased the buffer size of Tomcat to a good
amount and when I was parsing a page as simple as www.google.com)
But if I dont use the DOCUMENT_HTML option in OpenXML and just treat
the HTML as normal XML file, it does parse it properly(maybe it skips
the non terminated tags) but there's no way to return the XML back to
the browser because doc.getDocumentElement().toString() returns '[html]
1 nodes'

So then I switched to Jtidy and tried to convert HTML to XHTML. But it
seems the Document type returned by JTidy doesnt support most standard
document methods (including converting XML to string using
doc.getDocumentElement().toString()) leaving me at the same place where
I started from.

Can anybody suggest me what can be a good idea to approach my problem.
All that I want to do is grab a URL's HTML, add some tags to it (a
couple of appendChild()s) and then send the HTML back to the user to
be displayed(intrepreted) on the browser.

I'll be really thankful for your help!
Anupam


hi,

I did exactly the same thing with NekoHTML : parsing the HTML to XML,
then selecting some nodes with XPath, appending/replacing some nodes,
and transforming or serializing it back to HTML
http://people.apache.org/~andyc/neko/doc/html/index.html
(a nice tool)

--------------------------------------------

Did you think on a full XML solution ?

With Active Tags I used some tags/actions to achieve this. For this
purpose you could use RefleX at the top of Tomcat :
http://reflex.gforge.inria.fr/
(a nice tool too)
RefleX comes with a servlet that can run Active Tags

Your code would then look like this :
<web:service
xmlns:web="http://www.inria.fr/xml/active-tags/web"
xmlns:io="http://www.inria.fr/xml/active-tags/io"
xmlns:xcl="http://www.inria.fr/xml/active-tags/xcl"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
>

<!--understand it as a HTTP service-->

<!--things that are performed when the server starts-->
<web:init>
<!--share a stylesheet with all HTTP requests-->
<xcl:parse-stylesheet name="ralyx.xsl"
source="web:///WEB-INF/xslt/ralyx.xsl" scope="shared"/>
</web:init>

<!--map the URL-path with a regexp-->
<web:mapping
match="^/(\d{4})/Fiches/([\p{Lower}\d\-_+]+)/\2\.(html|xml)$"
method="GET" mime-type="">
<!--use an HTML parser because the documents are not
well-formed ; <xcl:parse-html> uses NekoHTML-->
<xcl:parse-html name="fiche"
source="http://www.inria.fr/recherche/equipes/{ $web:match/node()[ 2 ]
}.en.html"/>
<xcl:set name="corps" value="{
$fiche//xhtml:DIV[@class='corps'] }"/>
<xcl:set name="about" value="{ $corps/xhtml:TABLE//xhtml:TD[2] }"/>
<xcl:replace referent="{ $about }">
<td width="200px" align="right" class="projet">
<div class="menu_box">{ $about/node() }</div>
</td>
</xcl:replace>

<!--rebuild a new document-->
<xcl:document name="projet">
<projet xml="xml" title="{ string(
$corps/preceding-sibling::xhtml:H1 ) }">
{ $corps }
</projet>
</xcl:document>

<!--relativizing URLs in <A href> and <IMG src>-->
<xcl:for-each name="link" select="{ $projet//xhtml:A[@href] }">
<xcl:attribute referent="{ $link }" name="href" value="{
io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
$link/@href ) ) }"/>
</xcl:for-each>
<xcl:for-each name="link" select="{ $projet//xhtml:IMG[@src] }">
<xcl:attribute referent="{ $link }" name="src" value="{
io:resolve-uri('http://www.inria.fr/recherche/equipes/', string(
$link/@src ) ) }"/>
</xcl:for-each>

<!--selecting the stylesheet-->
<xcl:set xcl:if="{ $web:match/node()[ 3 ] != 'xml' }"
name="xslt" value="{ $ralyx.xsl }"/>
<!--back to the browser-->
<xcl:transform
output="{ value( $web:response/@web:output ) }"
source="{ $projet }"
stylesheet="{ $xslt }"
/>
</web:mapping>
</web:service>

the result is a new HTML document that contains an updated-part of
another HTML document (this mapping act almost like a proxy) ; it is
used in a real-application deployed at INRIA

to use it, simply declares the ReflexServlet in Tomcat :
<web-app>
<display-name>RefleX application</display-name>
<description>My RefleX application</description>
<servlet>
<servlet-name>ReflexServlet</servlet-name>
<display-name>RefleX servlet</display-name>
<description>Runs an Active Sheet</description>
<servlet-class>org.inria.reflex.ReflexServlet</servlet-class>
<init-param>
<param-name>activeSheetPath</param-name>
<param-value>web:///WEB-INF/active-sheet.xml</param-value>
</init-param>
<load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping><!--custom mappings-->
<url-pattern>*.gif</url-pattern>
<servlet-name>default</servlet-name>
</servlet-mapping>
<servlet-mapping><!--RefleX mapping-->
<servlet-name>ReflexServlet</servlet-name>
<url-pattern>/</url-pattern>
</servlet-mapping>
</web-app>

when downloading RefleX, check the dependencies and ensure that NekoHTML
0.9.5 is in the full distribution : for the moment, the last version of
RefleX available (0.1.2) uses NekoHTML 0.9.4 that is bugged regarding
namespace URIs ; this issue is fixed in NekoHTML 0.9.5 which is
available online and that will be in RefleX 0.1.3 (coming soon) ;

Enjoy :)

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !


--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
Mar 24 '06 #9
Philippe Poulard wrote:
I'm sure you'll get some help by contacting the developper


You may also find some help on Apache's mailing list for Xerces, since
NekoHTML is based on Xerces (and its author hangs out on that list).

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Mar 24 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: lars | last post by:
I'm looking at finally experimenting with java and am looking for some XML or HTML parser. Where can I find good ones? -- Lars "Chances are that patents on software ... in fact stifle...
4
by: Leif K-Brooks | last post by:
I'm writing a site with mod_python which will have, among other things, forums. I want to allow users to use some HTML (<em>, <strong>, <p>, etc.) on the forums, but I don't want to allow bad...
2
by: John Resler | last post by:
Hi all, First I want to say I am fully aware of the huge scope of the problem of parsing and correcting files of any sort. I have been using the jTidy libraries (Dave Raggett W3C, I believe) to...
29
by: Armand Karlsen | last post by:
I have a website ( http://www.zen62775.zen.co.uk ) that I made HTML 4.01 Transitional and CSS compliant, and I'm thinking of converting it into XHTML to learn a little about it. Which XHTML variant...
2
by: mike | last post by:
regards: I follow the following steps to converting from HTML to XHTML http://webpageworkshop.co.uk/main/xhtml_converting My parser is http://htmlparser.sourceforge.net/ Xhtml version is 1.0...
0
by: hawat.thufir | last post by:
"Xalan-Java is an XSLT processor for transforming XML documents into HTML, text, or other XML document types. It implements XSL Transformations (XSLT) Version 1.0 and XML Path Language (XPath)...
1
by: darrel | last post by:
I have two issues: 1) The WYSIWYG content editor we're using for our CMS doesn't truly support xhtml. 2) .net doesn't truly support xhtml my question is if there is a .net...
0
by: june | last post by:
Hi, I have a big problem with parsing HTML into a XHTML using Cberneko to validate the html. First I tried to work with a HTML-File. This solutions works fine: String aHTMLFile =...
12
by: steven acer | last post by:
hello, i have a java app that constructs an xml from a specific file format and vice versa. i've been asked to convert it to c++, but im not an expert in c++, actually im mere beginner you can...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.