473,241 Members | 1,448 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,241 software developers and data experts.

Need to extract XML or SGML entities from a Unicode text

I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?

Best regards

Jul 22 '06 #1
2 2770


Frantic wrote:
I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?
If you have the Unicode code number then in XML as well as SGML you can do
&#number;
or
&#xhexnumber;
In XML there are only a few entities predefined (e.g. quot, apos, lt,
gt, amp) so you would need to keep those in a dictionary or similar.

SGML itself does not predefine more entities I think, if you are
thinking about HTML and not SGML then HTML 4 defines some entities which
you can find here:
<http://www.w3.org/TR/html4/sgml/entities.html>
So you need to put those into a dictionary.

If you have an SGML or XML DTD defining character entities and you want
to read them out programmatically then you can do with an SGML
respectively XML parser. .NET has a built-in XML parser that can parse
DTDs but there is not much of an API to get at the information in the
DTD while the parsing is done.
The DOM API however, after the parsing, gives you some information about
entities:
C# example:

XmlDocument xmlDocument = new XmlDocument();
xmlDocument.LoadXml(@"<!DOCTYPE example [
<!ENTITY auml ""ä"">
]>
<example>Kibology</example>");
foreach (XmlEntity entity in xmlDocument.DocumentType.Entities) {
Console.WriteLine("Enity name: {0}, replacement: {1}.",
entity.Name, entity.InnerText);
}

prints out

Enity name: auml, replacement: .
--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Jul 23 '06 #2
Thanks for the answer,

I know I can build my own entity list but I would like to have a list
that is more or less standardized. I will explain my problem a bit
further. The company I help have a database where they store their
publications in SGML. So far they have only stored languages that are
more or less "simple" and have managed to pick out about 200 entities
that have to be translated to SGML. Now they want to include Japaneese,
Korean and Chinese into the database and suddenly the database will
increase to several thousands of entities.

Since asian languages use several thousands of characters I have
constructed a program that you can paste unicode text into (simply copy
the text of several publications into a textbox) and the program will
then create a list with each unique character together with their
hexadecimal code. So far I've imported only japaneese and got about 900
unique characters.

Now I want their SGML/XML-entity equivalents (or even HTML).

I've created a shortcut that works but isnt very beautyful - I copy the
list of unicode characters into a HTML-parser and then copy the
"translation" into another textbox and another function connects the
HTML-entities to the unicode by cycling through every entity. I would
like to skip this part if it is possible :)

Best regards,

/Jonas

Martin Honnen skrev:
Frantic wrote:
I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?

If you have the Unicode code number then in XML as well as SGML you can do
&#number;
or
&#xhexnumber;
In XML there are only a few entities predefined (e.g. quot, apos, lt,
gt, amp) so you would need to keep those in a dictionary or similar.

SGML itself does not predefine more entities I think, if you are
thinking about HTML and not SGML then HTML 4 defines some entities which
you can find here:
<http://www.w3.org/TR/html4/sgml/entities.html>
So you need to put those into a dictionary.

If you have an SGML or XML DTD defining character entities and you want
to read them out programmatically then you can do with an SGML
respectively XML parser. .NET has a built-in XML parser that can parse
DTDs but there is not much of an API to get at the information in the
DTD while the parsing is done.
The DOM API however, after the parsing, gives you some information about
entities:
C# example:

XmlDocument xmlDocument = new XmlDocument();
xmlDocument.LoadXml(@"<!DOCTYPE example [
<!ENTITY auml ""ä"">
]>
<example>Kibology</example>");
foreach (XmlEntity entity in xmlDocument.DocumentType.Entities) {
Console.WriteLine("Enity name: {0}, replacement: {1}.",
entity.Name, entity.InnerText);
}

prints out

Enity name: auml, replacement: .


--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Aug 1 '06 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Robert Oschler | last post by:
Is there a module/function to remove all the HTML entities from an HTML document (e.g. - &nbsp, &amp, &apos, etc.)? If not I'll just write one myself but I figured I'd save myself some time. ...
1
by: Usman | last post by:
Dear friends, I would like to ask about James Clark sx.exe parser from SGML to XML. I write the batch file like this : "E:\Project\sx\sx.exe" -wall "-DE:\Project\sx\entities"...
1
by: krammer | last post by:
Hello, Can any one please give me a short but concise pros and cons list of Unicode support in both SGML and XML? long story short, we are gonna port our leagacy SGML files to XML and the new...
6
by: S. | last post by:
if in my website i am using the sgml { notation, is it accurate to say to my users that the site uses unicode or that it requires unicode? is there a mathematical formula to calculate a unicode...
38
by: lawrence | last post by:
I'm just now trying to give my site a character encoding of UTF-8. The site has been built in a hodge-podge way over the last 6 years. The validator tells me I've lots of characters that don't...
2
by: Beat Richli | last post by:
Hello i have following problem with ASP (using Interdev, Win2003 Server): if a special character is entered in a textbox, ASP or the Client Browser (IE 6) seems to convert this character in HTML...
3
by: Laangen_LU | last post by:
Dear Group, my first post to this group, so if I'm on the wrong group, my apologies. I'm trying to send out an email in Chinese lanuage using the mail() function in PHP. Subject and...
3
by: bsagert | last post by:
Some web feeds use decimal character entities that seem to confuse Python (or me). For example, the string "doesn't" may be coded as "doesn’t" which should produce a right leaning apostrophe....
2
by: neovantage | last post by:
hey geeks, I am using a function which convert unicode to entities. So that i can save values into mysql database into entities. This function really helps me when i display the store entity data...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.