resolving an entity

Dean A. Hoover

I am writing a parser for xml that will not have
an associated DTD. I want to be able to handle
certain character references (e.g., ©) in
the program.

When I run the following against a chunk of xml
containing ©, I get the following:

org.xml.sax.SAXParseException: Reference to undefined entity "©".
at org.apache.crimson.parser.Parser2.fatal(Parser2.ja va:3182)
at org.apache.crimson.parser.Parser2.fatal(Parser2.ja va:3176)
at
org.apache.crimson.parser.Parser2.expandEntityInCo ntent(Parser2.java:2513)
at
org.apache.crimson.parser.Parser2.maybeReferenceIn Content(Parser2.java:2422)
at org.apache.crimson.parser.Parser2.content(Parser2. java:1833)
at org.apache.crimson.parser.Parser2.maybeElement(Par ser2.java:1507)
at org.apache.crimson.parser.Parser2.content(Parser2. java:1779)
at org.apache.crimson.parser.Parser2.maybeElement(Par ser2.java:1507)
at org.apache.crimson.parser.Parser2.content(Parser2. java:1779)
at org.apache.crimson.parser.Parser2.maybeElement(Par ser2.java:1507)
at org.apache.crimson.parser.Parser2.parseInternal(Pa rser2.java:500)
at org.apache.crimson.parser.Parser2.parse(Parser2.ja va:305)
at org.apache.crimson.parser.XMLReaderImpl.parse(XMLR eaderImpl.java:442)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:3 45)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:2 81)
at Article.main(Article.java:18)

What can I do to catch these references in my code and output replacement
text for it?

Thanks.
Dean Hoover

Here's the two java files:
---
import java.io.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class Article
{
public static void main(String argv[])
{
String file = argv[0];
PrintWriter pw = new PrintWriter(System.out);
DefaultHandler handler = new LoadXML(pw, LoadXML.TYPE_HTML);
SAXParserFactory factory = SAXParserFactory.newInstance();

try
{
SAXParser reader = factory.newSAXParser();
reader.parse(new File(file), handler);
}
catch (Exception e)
{
e.printStackTrace();
return;
}

pw.flush();
}
}
---
import java.io.*;
import java.util.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class LoadXML extends DefaultHandler
{
public static final int TYPE_HTML = 1;
public static final int TYPE_TEXT = 2;

public LoadXML
(
java.io.Writer writer,
int type
)
{
elements_ = new Stack();
writer_ = writer;
type_ = type;
}

public InputSource resolveEntity
(
String publicId,
String systemId
) throws SAXException
{
String s = "stuff";
return new InputSource(new CharArrayReader(s.toCharArray()));
}

public void startDocument() throws SAXException
{
}

public void endDocument() throws SAXException
{
}

public void startElement
(
String uri,
String localName,
String qName,
Attributes attributes
) throws SAXException
{
String elementName = qName;
elements_.push(elementName);

try
{
if (elementName.equals("p"))
{
if (type_ == TYPE_HTML)
writer_.write("");
}
else if (elementName.equals("title"))
{
if (type_ == TYPE_HTML)
writer_.write("");
}
else if (elementName.equals("by"))
{
if (type_ == TYPE_HTML)
writer_.write("");
}
else if (elementName.equals("copyright"))
{
if (type_ == TYPE_HTML)
writer_.write("");
}
}
catch (IOException e)
{
throw new SAXException(e);
}
}

public void endElement
(
String uri,
String localName,
String qName
) throws SAXException
{
String elementName = qName;
elements_.pop();

try
{
if (type_ == TYPE_HTML)
{
if (elementName.equals("p") || elementName.equals("title") ||
elementName.equals("by") || elementName.equals("copyright"))
{
writer_.write("\n");
}
else if (elementName.equals("br"))
{
writer_.write(" \n");
}
}
}
catch (IOException e)
{
throw new SAXException(e);
}
}

public void characters
(
char[] ch,
int start,
int length
) throws SAXException
{
try
{
String content = new String(ch, start, length);
String top = (String)elements_.peek();
String text =
content.replaceAll("\n", " ").replaceAll(" +", " ").trim();

if (text.length() == 0)
return;

if (type_ == TYPE_HTML)
{
if (top.equals("p") || top.equals("title") ||
top.equals("by") || top.equals("copyright"))
writer_.write(text);
}
}
catch (IOException e)
{
throw new SAXException(e);
}
}

private Stack elements_;
private java.io.Writer writer_;
private int type_;
}

Jul 20 '05 #1

Subscribe Post Reply

2598

Maarten Wiltink

"Dean A. Hoover" <dh*******@yahoo.com> wrote in message
news:4q********************@twister.nyroc.rr.com.. .

I am writing a parser for xml that will not have
an associated DTD. I want to be able to handle
certain character references (e.g., ©) in
the program.

As I understand it, that's quite impossible. The case is defined
in the spec, and without a DTD you don't get to choose what
entities are defined or not.

But DTD may not mean what you think it does. Would it be permissible
for this document to have an internal DTD subset?

<?xml version="1.0"?>
<!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
<root>©</root>

A quick reading of the XML spec suggests (but I may have missed
something) that this is a correct construction in XML.

Groetjes,
Maarten Wiltink

Jul 20 '05 #2

Dean A. Hoover

Maarten Wiltink wrote:

"Dean A. Hoover" <dh*******@yahoo.com> wrote in message
news:4q********************@twister.nyroc.rr.com.. .
I am writing a parser for xml that will not have
an associated DTD. I want to be able to handle
certain character references (e.g., ©) in
the program.

As I understand it, that's quite impossible. The case is defined
in the spec, and without a DTD you don't get to choose what
entities are defined or not.

But DTD may not mean what you think it does. Would it be permissible
for this document to have an internal DTD subset?

<?xml version="1.0"?>
<!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
<root>©</root>

A quick reading of the XML spec suggests (but I may have missed
something) that this is a correct construction in XML.

I really don't want any DTD in the document at all. I am writing
some code that will parse an xml document and output either html
or plain text depending on a parameter. In the case of HTML it
would output "©", in the case of plain text it would output
"(c)". I have other similar context based entities to handle as
well.

Dean

Jul 20 '05 #3

Martin Honnen

Dean A. Hoover wrote:

Maarten Wiltink wrote:
"Dean A. Hoover" <dh*******@yahoo.com> wrote in message
news:4q********************@twister.nyroc.rr.com.. .
I am writing a parser for xml that will not have
an associated DTD. I want to be able to handle
certain character references (e.g., ©) in
the program.

As I understand it, that's quite impossible. The case is defined
in the spec, and without a DTD you don't get to choose what
entities are defined or not.

But DTD may not mean what you think it does. Would it be permissible
for this document to have an internal DTD subset?

<?xml version="1.0"?>
<!DOCTYPE root [ <!ENTITY copy 'copy'> ]>
<root>©</root>

A quick reading of the XML spec suggests (but I may have missed
something) that this is a correct construction in XML.

I really don't want any DTD in the document at all. I am writing
some code that will parse an xml document and output either html
or plain text depending on a parameter. In the case of HTML it
would output "©", in the case of plain text it would output
"(c)". I have other similar context based entities to handle as
well.

Well, if you write your own parser then you can of course parse
something alike XML but with references to undefined entities. But then
don't attempt to parse it with an XML parser which expects entities to
be defined.

--

Martin Honnen
http://JavaScript.FAQTs.com/

Jul 20 '05 #4

Maarten Wiltink

"Dean A. Hoover" <dh*******@yahoo.com> wrote in message
news:uL********************@twister.nyroc.rr.com.. .

Maarten Wiltink wrote:
"Dean A. Hoover" <dh*******@yahoo.com> wrote in message
news:4q********************@twister.nyroc.rr.com.. .
I am writing a parser for xml that will not have
an associated DTD. I want to be able to handle
certain character references (e.g., ©) in
the program.
[...] I really don't want any DTD in the document at all. I am writing
some code that will parse an xml document and output either html
or plain text depending on a parameter. In the case of HTML it
would output "©", in the case of plain text it would output
"(c)". I have other similar context based entities to handle as
well.

That's reasonable, but entities simply aren't the solution.
Would using processing instructions instead be acceptable?

In XSLT, you could even source in the transformation itself
with document('') and switch treatment of <?copy?> based on
the output method.

I'm working under the assumption that you want the source to
be well-formed XML, valid if possible.

Groetjes,
Maarten Wiltink

Jul 20 '05 #5

Richard Tobin

In article <4q********************@twister.nyroc.rr.com>,
Dean A. Hoover <dh*******@yahoo.com> wrote:

I am writing a parser for xml that will not have
an associated DTD. I want to be able to handle
certain character references (e.g., ©) in
the program.

Well, this is not *real* XML.

The simplest thing to do would be to read the file into a string and
prepend an internal subset that declares the entities in question.
This will be easy if you know that there isn't an XML declaration or
DOCTYPE declaration in the file and you know the file's encoding.
Otherwise it will be more tedious.

-- Richard
--
Spam filter: to mail me from a .com/.net site, put my surname in the headers.

FreeBSD rules!

Jul 20 '05 #6

by: Vincent Lefevre | last post by:

I would like to know if the base URI considered to resolve an unparsed entity defined by a relative URI should be the URI before or after its rewriting due to a possible catalog. Let's take an...

.NET Framework

Nested <!ENTITY> Tags?

by: Ed Dennison | last post by:

I'm starting to look at DocBook-XML (not SGML) for producing a large documentation set. The hierarchy of DocBook elements for organizing the content is (more or less); set book part chapter...

.NET Framework

External Entity in Att Value: Why Forbidden?

by: Douglas Reith | last post by:

Hi There, Can someone please tell me why the XML spec states that an attribute value with an external entity is forbidden? Or point me to the appropriate document? Or better still, perhaps you...

.NET Framework

Stopping Xerces-j parser from resolving entities on its own

by: Vineeth | last post by:

Hi, I am using xerces2.6.0 and am developing a program for converting an xml document to a text file. My program is extending the DefaultHandler. The first problem I am facing is that even...

.NET Framework

Internal vs. external entity

by: Razvan | last post by:

Hi What is the difference between an internal and an external entity ? The first one is defined in the internal subset (not in a separate DTD file, but in the XML file itself - in...

.NET Framework

XmlDocument.Load(file) without entity resolving?

by: Gustaf Liljegren | last post by:

I need to merge several XML files into one large. All of them has a DOCTYPE tag, but the SYSTEM identifier points to a DTD that doesn't exist. (I use the PUBLIC identifier with catalog files, so...

.NET Framework

how to add or remove entity to a xml file

by: terry | last post by:

could someone tell me how to add or remove entity to a xml file when i dim xmlentity as new xmlentity it's say it's sube new is private thks

.NET Framework

Special Characters not resolving

by: Trac Bannon | last post by:

When I load XML from a file into a dotNet XMLDataDocument, the UTF-8 codes are resolved but the 5 special XML entities are not. How can I force those 5 special character types to be translated?

.NET Framework

How to relate a SQL based entity with an Object based entity in Entity Framework

by: markla | last post by:

Hi, I have an Entity data model built in Entity Framework, which sources data primarily from an MS SQL 2008 database, and sources some static (data dictionary) values from code-based objects. I...

ASP.NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Similar topics