Hello guys,
I get the "an invalid XML character" error when using xerces to parse
a XML file. I know that XML will correspond the &, <, >, " to special
strings like "><". However, how about if the XML file really
needs to contain some text like: ""? (as
content of a tag)
The story is:
I am writing a program to parse some XML files from another program.
In that program, it graps webpages, and saves the pages' URLs and
content into a XML file, something like (for each webpage):
<pageurl>http://www.cs.waikato.ac.nz/~ml/weka/agridatasets.jar</pageurl>
<pagecontent> the_page_HTML_content </pagecontent>
This works fine since that program will replace &, <, > etc with <
etc.
However, some web urls point to files: .zip, .pdf file, etc. The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:
PK���ÈR< +���&#
......
(Just think what you will see if you open a .pdf file in notepad!)
In this way, when I use a XML parser (xerces) to parse it, it will get
errors like:
FATAL: line 5079: Character reference "" is an invalid XML
character.
org.xml.sax.SAXParseException: Character reference "" is an
invalid XML character.
at org.apache.xerces.util.ErrorHandlerWrapper.createS AXParseException(Unknown
Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalEr ror(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportErro r(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportErro r(Unknown
Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError (Unknown Source)
at org.apache.xerces.impl.XMLScanner.scanCharReferenc eValue(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl.scanCharReference(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse( Unknown Source)
So, any idea how I can make it work?
How can I tell the xerces parser to ignore the "&xx;" pairs (except
those for <,>,", etc) and parse them just as plain text?
Thanks a lot.