469,275 Members | 1,497 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,275 developers. It's quick & easy.

Need to extract XML or SGML entities from a Unicode text

I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?

Best regards

Jul 22 '06 #1
2 2558


Frantic wrote:
I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?
If you have the Unicode code number then in XML as well as SGML you can do
&#number;
or
&#xhexnumber;
In XML there are only a few entities predefined (e.g. quot, apos, lt,
gt, amp) so you would need to keep those in a dictionary or similar.

SGML itself does not predefine more entities I think, if you are
thinking about HTML and not SGML then HTML 4 defines some entities which
you can find here:
<http://www.w3.org/TR/html4/sgml/entities.html>
So you need to put those into a dictionary.

If you have an SGML or XML DTD defining character entities and you want
to read them out programmatically then you can do with an SGML
respectively XML parser. .NET has a built-in XML parser that can parse
DTDs but there is not much of an API to get at the information in the
DTD while the parsing is done.
The DOM API however, after the parsing, gives you some information about
entities:
C# example:

XmlDocument xmlDocument = new XmlDocument();
xmlDocument.LoadXml(@"<!DOCTYPE example [
<!ENTITY auml ""ä"">
]>
<example>Kibology</example>");
foreach (XmlEntity entity in xmlDocument.DocumentType.Entities) {
Console.WriteLine("Enity name: {0}, replacement: {1}.",
entity.Name, entity.InnerText);
}

prints out

Enity name: auml, replacement: .
--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Jul 23 '06 #2
Thanks for the answer,

I know I can build my own entity list but I would like to have a list
that is more or less standardized. I will explain my problem a bit
further. The company I help have a database where they store their
publications in SGML. So far they have only stored languages that are
more or less "simple" and have managed to pick out about 200 entities
that have to be translated to SGML. Now they want to include Japaneese,
Korean and Chinese into the database and suddenly the database will
increase to several thousands of entities.

Since asian languages use several thousands of characters I have
constructed a program that you can paste unicode text into (simply copy
the text of several publications into a textbox) and the program will
then create a list with each unique character together with their
hexadecimal code. So far I've imported only japaneese and got about 900
unique characters.

Now I want their SGML/XML-entity equivalents (or even HTML).

I've created a shortcut that works but isnt very beautyful - I copy the
list of unicode characters into a HTML-parser and then copy the
"translation" into another textbox and another function connects the
HTML-entities to the unicode by cycling through every entity. I would
like to skip this part if it is possible :)

Best regards,

/Jonas

Martin Honnen skrev:
Frantic wrote:
I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?

If you have the Unicode code number then in XML as well as SGML you can do
&#number;
or
&#xhexnumber;
In XML there are only a few entities predefined (e.g. quot, apos, lt,
gt, amp) so you would need to keep those in a dictionary or similar.

SGML itself does not predefine more entities I think, if you are
thinking about HTML and not SGML then HTML 4 defines some entities which
you can find here:
<http://www.w3.org/TR/html4/sgml/entities.html>
So you need to put those into a dictionary.

If you have an SGML or XML DTD defining character entities and you want
to read them out programmatically then you can do with an SGML
respectively XML parser. .NET has a built-in XML parser that can parse
DTDs but there is not much of an API to get at the information in the
DTD while the parsing is done.
The DOM API however, after the parsing, gives you some information about
entities:
C# example:

XmlDocument xmlDocument = new XmlDocument();
xmlDocument.LoadXml(@"<!DOCTYPE example [
<!ENTITY auml ""ä"">
]>
<example>Kibology</example>");
foreach (XmlEntity entity in xmlDocument.DocumentType.Entities) {
Console.WriteLine("Enity name: {0}, replacement: {1}.",
entity.Name, entity.InnerText);
}

prints out

Enity name: auml, replacement: .


--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Aug 1 '06 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

7 posts views Thread by Robert Oschler | last post: by
1 post views Thread by Usman | last post: by
1 post views Thread by krammer | last post: by
6 posts views Thread by S. | last post: by
3 posts views Thread by Laangen_LU | last post: by
3 posts views Thread by bsagert | last post: by
reply views Thread by zhoujie | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.