473,405 Members | 2,282 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Need to extract XML or SGML entities from a Unicode text

I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?

Best regards

Jul 22 '06 #1
2 2780


Frantic wrote:
I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?
If you have the Unicode code number then in XML as well as SGML you can do
&#number;
or
&#xhexnumber;
In XML there are only a few entities predefined (e.g. quot, apos, lt,
gt, amp) so you would need to keep those in a dictionary or similar.

SGML itself does not predefine more entities I think, if you are
thinking about HTML and not SGML then HTML 4 defines some entities which
you can find here:
<http://www.w3.org/TR/html4/sgml/entities.html>
So you need to put those into a dictionary.

If you have an SGML or XML DTD defining character entities and you want
to read them out programmatically then you can do with an SGML
respectively XML parser. .NET has a built-in XML parser that can parse
DTDs but there is not much of an API to get at the information in the
DTD while the parsing is done.
The DOM API however, after the parsing, gives you some information about
entities:
C# example:

XmlDocument xmlDocument = new XmlDocument();
xmlDocument.LoadXml(@"<!DOCTYPE example [
<!ENTITY auml ""ä"">
]>
<example>Kibology</example>");
foreach (XmlEntity entity in xmlDocument.DocumentType.Entities) {
Console.WriteLine("Enity name: {0}, replacement: {1}.",
entity.Name, entity.InnerText);
}

prints out

Enity name: auml, replacement: ä.
--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Jul 23 '06 #2
Thanks for the answer,

I know I can build my own entity list but I would like to have a list
that is more or less standardized. I will explain my problem a bit
further. The company I help have a database where they store their
publications in SGML. So far they have only stored languages that are
more or less "simple" and have managed to pick out about 200 entities
that have to be translated to SGML. Now they want to include Japaneese,
Korean and Chinese into the database and suddenly the database will
increase to several thousands of entities.

Since asian languages use several thousands of characters I have
constructed a program that you can paste unicode text into (simply copy
the text of several publications into a textbox) and the program will
then create a list with each unique character together with their
hexadecimal code. So far I've imported only japaneese and got about 900
unique characters.

Now I want their SGML/XML-entity equivalents (or even HTML).

I've created a shortcut that works but isnt very beautyful - I copy the
list of unicode characters into a HTML-parser and then copy the
"translation" into another textbox and another function connects the
HTML-entities to the unicode by cycling through every entity. I would
like to skip this part if it is possible :)

Best regards,

/Jonas

Martin Honnen skrev:
Frantic wrote:
I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?

If you have the Unicode code number then in XML as well as SGML you can do
&#number;
or
&#xhexnumber;
In XML there are only a few entities predefined (e.g. quot, apos, lt,
gt, amp) so you would need to keep those in a dictionary or similar.

SGML itself does not predefine more entities I think, if you are
thinking about HTML and not SGML then HTML 4 defines some entities which
you can find here:
<http://www.w3.org/TR/html4/sgml/entities.html>
So you need to put those into a dictionary.

If you have an SGML or XML DTD defining character entities and you want
to read them out programmatically then you can do with an SGML
respectively XML parser. .NET has a built-in XML parser that can parse
DTDs but there is not much of an API to get at the information in the
DTD while the parsing is done.
The DOM API however, after the parsing, gives you some information about
entities:
C# example:

XmlDocument xmlDocument = new XmlDocument();
xmlDocument.LoadXml(@"<!DOCTYPE example [
<!ENTITY auml ""ä"">
]>
<example>Kibology</example>");
foreach (XmlEntity entity in xmlDocument.DocumentType.Entities) {
Console.WriteLine("Enity name: {0}, replacement: {1}.",
entity.Name, entity.InnerText);
}

prints out

Enity name: auml, replacement: ä.


--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Aug 1 '06 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Robert Oschler | last post by:
Is there a module/function to remove all the HTML entities from an HTML document (e.g. - &nbsp, &amp, &apos, etc.)? If not I'll just write one myself but I figured I'd save myself some time. ...
1
by: Usman | last post by:
Dear friends, I would like to ask about James Clark sx.exe parser from SGML to XML. I write the batch file like this : "E:\Project\sx\sx.exe" -wall "-DE:\Project\sx\entities"...
1
by: krammer | last post by:
Hello, Can any one please give me a short but concise pros and cons list of Unicode support in both SGML and XML? long story short, we are gonna port our leagacy SGML files to XML and the new...
6
by: S. | last post by:
if in my website i am using the sgml { notation, is it accurate to say to my users that the site uses unicode or that it requires unicode? is there a mathematical formula to calculate a unicode...
38
by: lawrence | last post by:
I'm just now trying to give my site a character encoding of UTF-8. The site has been built in a hodge-podge way over the last 6 years. The validator tells me I've lots of characters that don't...
2
by: Beat Richli | last post by:
Hello i have following problem with ASP (using Interdev, Win2003 Server): if a special character is entered in a textbox, ASP or the Client Browser (IE 6) seems to convert this character in HTML...
3
by: Laangen_LU | last post by:
Dear Group, my first post to this group, so if I'm on the wrong group, my apologies. I'm trying to send out an email in Chinese lanuage using the mail() function in PHP. Subject and...
3
by: bsagert | last post by:
Some web feeds use decimal character entities that seem to confuse Python (or me). For example, the string "doesn't" may be coded as "doesn’t" which should produce a right leaning apostrophe....
2
by: neovantage | last post by:
hey geeks, I am using a function which convert unicode to entities. So that i can save values into mysql database into entities. This function really helps me when i display the store entity data...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.