I'm a real XML novice, but my ultimate goal here is to get a workable schema
for the GEDCOM XML format as spec'ed out here:
http://www.familysearch.org/GEDCOM/GedXML60.pdf
It's a proposed XML format for genealogical records. They include a DTD in
the spec but sadly its incomplete in that the spec allows for "html"
(unlimited HTML or a subset they don't say) in certain elements to allow for
formatting. There is also a sample GEDCOM file in the spec with some
<html:BR /elements. There is no tag for "html:BR" in the DTD so it fails
on their included GEDCOM file.
So I have a few things to do. I want a schema, not a DTD and I want HTML to
be allowed in certain elements. An approximate schema I can produce by
using VS to change the DTD to a schema. The HTML part I have no idea how to
do. I would love to be able to say in the schema that the citation element
(to name one example) can have any tag from the "html" namespace but I don't
see how to do this. So I scaled back my expectations and decided to just
try to allow for <html:BR />. You can't declare this directly because XML
gets hung up on the namespace. I couldn't find anything that talked about
this in all the schema docs I looked at so I decided to let VS handle it for
me. I produced a schema from the GEDCOM XML file and it produced two xsd
files. One is the gedcom.xsd and one is a schema file which solely defines
<html:BR />. It also placed an import element in the gedcom.xsd file:
<xs:import namespace="http//:www.w3c.org/TR/REC-html40/" />
This does, in fact, seem to reference the file with html:BR defined,
although I fail to see how the resolution works. I mean, it seems to work
through the target namespace from the imported schema matching up with the
namespace named in the import statement and that, in turn matching up with
the xmlns attribute in the schema element, but I don't know how it finds the
html file to make these matches. It certainly works for me in VS and the
gedcom XML file with <html:BR /seems to validate properly according to the
VS XML editor. They (the base and html schemas) are in the same directory
which is the directory of my executable so I kind of figured that the
resolution was done by looking at every xsd file in the directory to see if
any have that namespace as a target and importing them if so. In any event,
since the XML editor was telling me that this file was valid with this
schema, I thought that was great, I'll definitely get it to work when I run
my simple program which, currently, simple reads the XML file into an
XmlDocument. Now, in order to test the schema in the editor, I have to
explicitly reference the filename with a noNamespaceSchemaLocation attribute
in the root element of the XML file (is there some other way?). If I'm
running this on general GEDCOM files, obviously they're not going to
reference my schema file so I have to remove the explicit filename link
formed by the noNamespaceSchemaLocation attribute and instead, load in the
schema manually and attempt to use it to validate like so:
XmlSchemaSet sc = new XmlSchemaSet();
sc.Add(null, "gedcom.xsd");
XmlReaderSettings settings = new XmlReaderSettings();
settings.ValidationType = ValidationType.Schema;
settings.Schemas = sc;
settings.ValidationEventHandler += new
ValidationEventHandler(ValidationCallBack);
XmlReader xrd = XmlReader.Create(strXmlFile, settings);
(_xdoc = new XmlDocument()).Load(xrd);
When I do this, it tells me that it can't load in the schema file because
html:BR is not declared so it would appear that my guess about the
resolution occuring because they were in the same directory is wrong - at
least when I'm doing this all programmatically.
So what do I need to do here to get it to do the import correctly? It
appears that importing is the only way to handle the html namespace since I
produced the schema from the GEDCOM file in another piece of software and it
also produced two files with one importing from the other.
Finally, and this is really a pure XML question: supposing I do get this to
work. That only allows me to use the html namespace in my base schema
declarations. In order to allow for all of HTML in certain elements as I'm
doing things now, that means I have to declare every HTML tag in each of the
elements I want to allow them in. Is that really what has to be done? Do I
have to put *all* potential HTML tags in my schema to declare them as legal?
Again, it would be *really* nice to just be able to say "anything from the
HTML namespace is legal here". Is there some good way of doing that? I
suppose I could just allow for some subset of HTML tags but it would be nice
to give an option where the user could just type in (or paste in) any HTML
he/she likes and I would store it mostly as a black box without ever
directly interpreting any of the contents but just passing them in to an
HTML viewer control.
Sorry for the long post. I don't know how to make it any more brief. If
you're still with me, thanks for any ideas you might have on this!
Darrell