473,396 Members | 1,865 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

XML parser

vc
Hi,

I'm looking for an XML parser that wouldn't stop if it finds a minor error
in an XML file. I need to parse an HTML file and there are a lot of HTML
pages that, for instance, don't enclose attribute values in quotes.
Or, for instance, most of HTML pages don't have a root tag/element (that
could be "html"). Instead, they have "doctype" tag before and at the same
level with "html" and XML parsers report an error "no root tag found".

I have tried 3-4 SAX parsers, but none of them works :-(

It would be great if you can recommend a C++ or Java (preferably SAX 2.0
compliant) XML parser.
Thank you in advance,

vc

Jul 23 '05 #1
5 1424
vc wrote:
I'm looking for an XML parser that wouldn't stop if it finds a minor error
in an XML file. I need to parse an HTML file and there are a lot of HTML
pages that, for instance, don't enclose attribute values in quotes.


Use tidy -asxhtml to convert it to XHTML. Then use XPath to query into it.

http://tidy.sourceforge.net/

Shell to tidy with system() or _popen() - don't bother to link it.

And note the entire purpose of XML is to be a well-formed data language, not
a forgiving Notepad-oriented markup language. I really doubt you'l find an
XML parser that permits ill-formed input!

--
Phlip
http://www.c2.com/cgi/wiki?ZeekLand
Jul 23 '05 #2
vc wrote:
Hi,

I'm looking for an XML parser that wouldn't stop if it finds a minor error
in an XML file. I need to parse an HTML file and there are a lot of HTML
pages that, for instance, don't enclose attribute values in quotes.
Or, for instance, most of HTML pages don't have a root tag/element (that
could be "html"). Instead, they have "doctype" tag before and at the same
level with "html" and XML parsers report an error "no root tag found".

I have tried 3-4 SAX parsers, but none of them works :-(

It would be great if you can recommend a C++ or Java (preferably SAX 2.0
compliant) XML parser.
Thank you in advance,

vc


why don't you use an HTML parser ?

try this one :
http://people.apache.org/~andyc/neko/doc/html/
it's a nice toy

--
Cordialement,

///
(. .)
-----ooO--(_)--Ooo-----
| Philippe Poulard |
-----------------------
Jul 23 '05 #3
vc wrote:
Hi,

I'm looking for an XML parser that wouldn't stop if it finds a minor error
in an XML file.
onsgmls keeps going to the end (or a configurable number of errors).
Part of OpenSP from http://sourceforge.net/projects/openjade/
I need to parse an HTML file and there are a lot of HTML
pages that, for instance, don't enclose attribute values in quotes.
But they may be perfectly valid SGML, not XML. SGML permits lots of
abbreviations that are not allowed in XML.

Or they may just be garbage (more likely :-)
You can run them through HTML Tidy to try and make them XHTML.
Or, for instance, most of HTML pages don't have a root tag/element (that
could be "html").
That, too, is permitted in some older SGML DTDs for HTML.
Instead, they have "doctype" tag before and at the same
level with "html" and XML parsers report an error "no root tag found".


That's a DocType Declaration. It specified the version of HTML being used
(in theory: in practice it's garbage added by editors which don't know
what they are doing and just throw it in to confuse things).

Again, use HTML Tidy to try and make the file into XHTML.
Then validate with:

$ onsgmls -wxml -s /your/path/to/xml.dec filename.xml

If you use Emacs, this can be configured to happen automatically when you
validate a document, and the error lines get coloured and become links to
the location in the document where the error was spotted.

You will need a copy of the XML Declaration (xml.dec). The original at
http://www.w3.org/TR/NOTE-sgml-xml-971215 is starting to suffer from
bitrot and W3C neglect, so I have put a working copy online at
http://xml.silmaril.ie/xml.dec_onsgmls (note this is slightly different
from the original, which is available at http://xml.silmaril.ie/xml.dec_jc)
Just rename it to xml.dec on your machine.

///Peter
--
sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
&;top"
Jul 23 '05 #4
vc wrote:
Hi,

I'm looking for an XML parser that wouldn't stop if it finds a minor error
in an XML file. I need to parse an HTML file and there are a lot of HTML
pages that, for instance, don't enclose attribute values in quotes.
Or, for instance, most of HTML pages don't have a root tag/element (that
could be "html"). Instead, they have "doctype" tag before and at the same
level with "html" and XML parsers report an error "no root tag found".


People have suggested Tidy, nekohtml and onsgmls. I'd suggest the HTML
parser from libxml2 in preference to those for most purposes.

But you dont' necessarily need any such thing. Although XML parsers
are required to stop on encountering a fatal error, many of them can
be set to continue. For example, mod_validator sets Xerces to continue
so it will report all errors in an XML document.

--
Nick Kew
Jul 23 '05 #5
Nick Kew (ni**@asgard.webthing.com) wrote:
: vc wrote:
: > Hi,
: >
: > I'm looking for an XML parser that wouldn't stop if it finds a minor error
: > in an XML file. I need to parse an HTML file and there are a lot of HTML
: > pages that, for instance, don't enclose attribute values in quotes.
: > Or, for instance, most of HTML pages don't have a root tag/element (that
: > could be "html"). Instead, they have "doctype" tag before and at the same
: > level with "html" and XML parsers report an error "no root tag found".

: People have suggested Tidy, nekohtml and onsgmls. I'd suggest the HTML
: parser from libxml2 in preference to those for most purposes.

: But you dont' necessarily need any such thing. Although XML parsers
: are required to stop on encountering a fatal error, many of them can
: be set to continue. For example, mod_validator sets Xerces to continue
: so it will report all errors in an XML document.

another is

perl

module: HTML::Parser
same idea as a SAX parser, but expects html, handles many many things that
are common, and is quite speedy, and comes pre-installed with many perl
distros.
--

This space not for rent.
Jul 23 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
by: Paulo Pinto | last post by:
Hi, does anyone know of a Python package that is able to load XML like the XML::Simple Perl package does? For those that don't know it, this package maps the XML file to a dictionary.
11
by: Jean de Largentaye | last post by:
Hi, I need to parse a subset of C (a header file), and generate some unit tests for the functions listed in it. I thus need to parse the code, then rewrite function calls with wrong parameters....
1
by: Karalius, Joseph | last post by:
Can anyone explain what is happening here? I haven't found any useful info on Google yet. Thanks in advance. mmagnet:/home/jkaralius/src/zopeplone/Python-2.3.5 # make gcc -pthread -c...
3
by: Himanshu Garg | last post by:
Hello, I am trying to pinpoint an apparent bug in HTML::Parser. The encoding of the text seems to change incorrectly if the locale isn't set properly. However Parser.pm in the directory...
2
by: Joel Hedlund | last post by:
Hi! I have a possibly dumb question about imports. I've written two python modules: parser.py ------------------------------------ class Parser(object): "my parser"...
5
by: thewarden | last post by:
I've come into a situation where I require to have BBCode parsed, this includes the standard tags supported by PEAR package HTML_BBCodeParser and custom BBCode tags I've added myself. My problem...
28
by: Marc Gravell | last post by:
In Linq, you can apparently get a meaningful body from and expression's .ToString(); random question - does anybody know if linq also includes a parser? It just seemed it might be a handy way to...
0
by: UncleRic | last post by:
Environment: Mac OS X (10.4.10) on MacBook Pro I'm a Perl Neophyte. I've downloaded the XML::Parser module and am attempting to install it in my working directory (referenced via PERL5LIB env): ...
18
by: Just Another Victim of the Ambient Morality | last post by:
Is pyparsing really a recursive descent parser? I ask this because there are grammars it can't parse that my recursive descent parser would parse, should I have written one. For instance: ...
0
by: arvindkgs | last post by:
Iam using c lexer that is flex generated and a c++ parser that is bison generated. i have modified the parser to acccept only string input. I am calling the parser function yyparse in a loop and...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.