473,226 Members | 1,485 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,226 software developers and data experts.

this code: &#x3, an invalid XML character error.

Hello guys,
I get the "an invalid XML character" error when using xerces to parse
a XML file. I know that XML will correspond the &, <, >, " to special
strings like "&gt;&lt;". However, how about if the XML file really
needs to contain some text like: "&#x3;&#x4;&#x14;&#x8;&#x8;"? (as
content of a tag)

The story is:
I am writing a program to parse some XML files from another program.
In that program, it graps webpages, and saves the pages' URLs and
content into a XML file, something like (for each webpage):

<pageurl>http://www.cs.waikato.ac.nz/~ml/weka/agridatasets.jar</pageurl>
<pagecontent> the_page_HTML_content </pagecontent>

This works fine since that program will replace &, <, > etc with &lt;
etc.

However, some web urls point to files: .zip, .pdf file, etc. The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:

PK&#x3;&#x4;&#x14;&#x0;&#x8;&#x0;&#x8;&#x0;ÈR&lt; +&#x0;&#x0;&#x0;&#
......
(Just think what you will see if you open a .pdf file in notepad!)

In this way, when I use a XML parser (xerces) to parse it, it will get
errors like:

FATAL: line 5079: Character reference "&#x3" is an invalid XML
character.
org.xml.sax.SAXParseException: Character reference "&#x3" is an
invalid XML character.
at org.apache.xerces.util.ErrorHandlerWrapper.createS AXParseException(Unknown
Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalEr ror(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportErro r(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportErro r(Unknown
Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError (Unknown Source)
at org.apache.xerces.impl.XMLScanner.scanCharReferenc eValue(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl.scanCharReference(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerI mpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse (Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse( Unknown Source)

So, any idea how I can make it work?
How can I tell the xerces parser to ignore the "&xx;" pairs (except
those for <,>,", etc) and parse them just as plain text?

Thanks a lot.
Jul 20 '05 #1
3 19568
In article <78**************************@posting.google.com >,
Kaidi <ka*******@yahoo.com.sg> wrote:

% I get the "an invalid XML character" error when using xerces to parse
% a XML file. I know that XML will correspond the &, <, >, " to special
% strings like "&gt;&lt;". However, how about if the XML file really
% needs to contain some text like: "&#x3;&#x4;&#x14;&#x8;&#x8;"? (as
% content of a tag)

The only valid characters in an XML file are the non-control code points
from Unicode, tab, carriage-return, and line-feed. Even if you enter
them as numeric entity references, other control characters (such as
&#x3;) are not allowed. I suggest encoding binary data using one of
the schemes recognised in mime, such as quoted-printable (for text with
the odd control character) or base64.

% However, some web urls point to files: .zip, .pdf file, etc. The
% program just "prints" the .pdf content as text and puts it in the XML
% file. In this case, the content of <pagecontent> will look like:

For these, use base64.

--

Patrick TJ McPhee
East York Canada
pt**@interlog.com
Jul 20 '05 #2
Kaidi wrote:
The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:

PK&#x3;&#x4;&#x14;&#x0;&#x8;&#x0;&#x8;&#x0;ÈR&lt; +&#x0;&#x0;&#x0;&#
......
(Just think what you will see if you open a .pdf file in notepad!)

In this way, when I use a XML parser (xerces) to parse it,


Why do you want to parse PDF with an XML parser? When downloading the
resources, you may store the content-type and make XML pasring dependent
on the content-type.
--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
Jul 20 '05 #3
Johannes Koch <ko**@w3development.de> wrote in message news:<2r*************@uni-berlin.de>...
Kaidi wrote:
The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:

PK&#x3;&#x4;&#x14;&#x0;&#x8;&#x0;&#x8;&#x0;?R&lt; +&#x0;&#x0;&#x0;&#
......
(Just think what you will see if you open a .pdf file in notepad!)

In this way, when I use a XML parser (xerces) to parse it,


Why do you want to parse PDF with an XML parser? When downloading the
resources, you may store the content-type and make XML pasring dependent
on the content-type.


yes, if let me write the whole program, I will do that way. The
problem is: the existing program (which I can not change) is doing
that way: it just put .jar/pdf, etc. into one XML file. I need to
process this XML file. :-(
Jul 20 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Barry Young | last post by:
I am using the following code to insert a row in an Oracle Database. strConnection = "Provider=OraOLEDB.Oracle;Data Source=MYDATABASE;User Id=SYSTEM;Password=******" Dim strMessage As String ...
4
by: Supertzar | last post by:
Hi, I'm trying to use MapPath on a folder named with a comma in it: xmldoc.load Server.MapPath(Request.QueryString("fname")) where the querysting is something like "fname=this, that/this,...
2
by: Anna Carr | last post by:
I had a project which was working until this morning and I keep getting the following error. Does anyone know what has happened and how I can fix it?? error message - There is an invalid...
6
by: jasn | last post by:
Hello I am getting the following error message when I try and send an XML sting to a web service, I read somewhere that most web services prefer ascii and some throw errors when using unicode so...
0
by: DCC700 | last post by:
After converting a web application to 2005, I am receiving an invalid character error when I change the value in a dropdown list. The dropdown list is set to postback on selected index changed and...
3
by: Middletree | last post by:
Teaching myself ASP.NET, using Microsoft Press book called ASP.NET 2.0 by George Shepherd. Ch. 2, it says to type the following code: <%@ Page Language="C#" %> <html>
1
by: DCC700 | last post by:
After upgrading a web application from VS 2003 to 2005 there is a page where any control event that should cause a postback instead generates an Invalid character error on the page. For example a...
4
by: Arpan | last post by:
I am working on Win2K Pro. Due to some problems in IIS5.0, I had to uninstall & re-install IIS. After re-installing IIS, when I try to run any ASPX page (all my ASPX pages reside in...
1
by: qbp90x5lb | last post by:
I'm using an XSLT transform to output the element value contents from a simple XML file into a new .TXT file. Everything works fine except for certain XML files, when calling msxsl with the .xslt, I...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, youll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.