Greetings;
I have an application where I am receiving HTML fragments containing
snippets (sub-fragments?) of XML. I wish to extract the XML bits for further
processing. In the process of playing around with various ways to accomplish
this I came up with the following code:
string xmlFrag=
@"
<P> </P>
<P>This is a test <xml><field name=""instructor-name"">Brian
Cobb</field></xml></P>
<P> </P>
<P>Another test</P><xml><field name=""picture"">brianc.jpg</field></xml>" ;
try
{
string subset = "<!ENTITY nbsp ' '>";
XmlParserContext context = new XmlParserContext(
null,
null,
"html",
#if USING_DOCTYPE_IDS
this.publicID,
#else
null,
#endif
#if USING_DOCTYPE_IDS
this.systemID,
#else
null,
#endif
#if USE_SUBSET
subset,
#else
null,
#endif
"",
"en-us",
XmlSpace.None,
System.Text.Encoding.UTF8
);
XmlValidatingReader reader =
new XmlValidatingReader(xmlFrag, XmlNodeType.Element, context);
reader.ValidationType = ValidationType.None;
while(reader.Read()) { ...
Note that this.systemID = @"http://www.w3.org/TR/html4/loose.dtd"
and this.publicID = @"-//W3C//DTD HTML 4.01 Transitional//EN"
in my most recent test and that USING_DOCTYPE_IDS is defined.
The problem I am having is, that on the first call to reader.Read() I get an
XmlException with the message: "This is an unexpected token. The expected
token is 'TAGEND'" at Line 31, Column 3". Since there aren't 31 lines in
xmlFrag I would surmise that the problem lies with the external DTD; however,
selecting different values for systemID and publicID produces similar
results. If I #undef USING_DOCTYPE_IDS, the code works as expected.
What I am trying to avoid is having to define my own general entities as I
have done with nbsp in my example. Fiddling with systemID and publicID is
my initial attempt to use standard DTDs to get around this (think of small
child whining "But Mommy, I don't want to define my own DTD"). But, if anyone
has a better idea how to accomplish this, I'm all ears.
thanks.
PS: if USE_SUBSET is #undef'ed I get XmlException "undefined entity nbsp".