468,771 Members | 1,703 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,771 developers. It's quick & easy.

this code: &#x3, an invalid XML character error.

Hello guys,
I get the "an invalid XML character" error when using xerces to parse
a XML file. I know that XML will correspond the &, <, >, " to special
strings like "&gt;&lt;". However, how about if the XML file really
needs to contain some text like: "&#x3;&#x4;&#x14;&#x8;&#x8;"? (as
content of a tag)

The story is:
I am writing a program to parse some XML files from another program.
In that program, it graps webpages, and saves the pages' URLs and
content into a XML file, something like (for each webpage):

<pageurl>http://www.cs.waikato.ac.nz/~ml/weka/agridatasets.jar</pageurl>
<pagecontent> the_page_HTML_content </pagecontent>

This works fine since that program will replace &, <, > etc with &lt;
etc.

However, some web urls point to files: .zip, .pdf file, etc. The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:

PK&#x3;&#x4;&#x14;&#x0;&#x8;&#x0;&#x8;&#x0;ÈR&lt; +&#x0;&#x0;&#x0;&#
......
(Just think what you will see if you open a .pdf file in notepad!)

In this way, when I use a XML parser (xerces) to parse it, it will get
errors like:

FATAL: line 5079: Character reference "&#x3" is an invalid XML
character.
org.xml.sax.SAXParseException: Character reference "&#x3" is an
invalid XML character.
at org.apache.xerces.util.ErrorHandlerWrapper.createS AXParseException(Unknown
Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalEr ror(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportErro r(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportErro r(Unknown
Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError (Unknown Source)
at org.apache.xerces.impl.XMLScanner.scanCharReferenc eValue(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl.scanCharReference(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse( Unknown Source)

So, any idea how I can make it work?
How can I tell the xerces parser to ignore the "&xx;" pairs (except
those for <,>,", etc) and parse them just as plain text?

Thanks a lot.
Jul 20 '05 #1
3 17923
In article <78**************************@posting.google.com >,
Kaidi <ka*******@yahoo.com.sg> wrote:

% I get the "an invalid XML character" error when using xerces to parse
% a XML file. I know that XML will correspond the &, <, >, " to special
% strings like "&gt;&lt;". However, how about if the XML file really
% needs to contain some text like: "&#x3;&#x4;&#x14;&#x8;&#x8;"? (as
% content of a tag)

The only valid characters in an XML file are the non-control code points
from Unicode, tab, carriage-return, and line-feed. Even if you enter
them as numeric entity references, other control characters (such as
&#x3;) are not allowed. I suggest encoding binary data using one of
the schemes recognised in mime, such as quoted-printable (for text with
the odd control character) or base64.

% However, some web urls point to files: .zip, .pdf file, etc. The
% program just "prints" the .pdf content as text and puts it in the XML
% file. In this case, the content of <pagecontent> will look like:

For these, use base64.

--

Patrick TJ McPhee
East York Canada
pt**@interlog.com
Jul 20 '05 #2
Kaidi wrote:
The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:

PK&#x3;&#x4;&#x14;&#x0;&#x8;&#x0;&#x8;&#x0;ÈR&lt; +&#x0;&#x0;&#x0;&#
......
(Just think what you will see if you open a .pdf file in notepad!)

In this way, when I use a XML parser (xerces) to parse it,


Why do you want to parse PDF with an XML parser? When downloading the
resources, you may store the content-type and make XML pasring dependent
on the content-type.
--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
Jul 20 '05 #3
Johannes Koch <ko**@w3development.de> wrote in message news:<2r*************@uni-berlin.de>...
Kaidi wrote:
The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:

PK&#x3;&#x4;&#x14;&#x0;&#x8;&#x0;&#x8;&#x0;?R&lt; +&#x0;&#x0;&#x0;&#
......
(Just think what you will see if you open a .pdf file in notepad!)

In this way, when I use a XML parser (xerces) to parse it,


Why do you want to parse PDF with an XML parser? When downloading the
resources, you may store the content-type and make XML pasring dependent
on the content-type.


yes, if let me write the whole program, I will do that way. The
problem is: the existing program (which I can not change) is doing
that way: it just put .jar/pdf, etc. into one XML file. I need to
process this XML file. :-(
Jul 20 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Supertzar | last post: by
2 posts views Thread by Anna Carr | last post: by
4 posts views Thread by Arpan | last post: by
reply views Thread by zhoujie | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.