Need to extract XML or SGML entities from a Unicode text

Frantic

I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?

Best regards

Jul 22 '06 #1

Subscribe Post Reply

2780

Martin Honnen

Frantic wrote:

I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?

If you have the Unicode code number then in XML as well as SGML you can do
&#number;
or
&#xhexnumber;
In XML there are only a few entities predefined (e.g. quot, apos, lt,
gt, amp) so you would need to keep those in a dictionary or similar.

SGML itself does not predefine more entities I think, if you are
thinking about HTML and not SGML then HTML 4 defines some entities which
you can find here:
<http://www.w3.org/TR/html4/sgml/entities.html>
So you need to put those into a dictionary.

If you have an SGML or XML DTD defining character entities and you want
to read them out programmatically then you can do with an SGML
respectively XML parser. .NET has a built-in XML parser that can parse
DTDs but there is not much of an API to get at the information in the
DTD while the parsing is done.
The DOM API however, after the parsing, gives you some information about
entities:
C# example:

XmlDocument xmlDocument = new XmlDocument();
xmlDocument.LoadXml(@"<!DOCTYPE example [
<!ENTITY auml ""ä"">
]>
<example>Kibology</example>");
foreach (XmlEntity entity in xmlDocument.DocumentType.Entities) {
Console.WriteLine("Enity name: {0}, replacement: {1}.",
entity.Name, entity.InnerText);
}

prints out

Enity name: auml, replacement: ä.
--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/

Jul 23 '06 #2

Frantic

Thanks for the answer,

I know I can build my own entity list but I would like to have a list
that is more or less standardized. I will explain my problem a bit
further. The company I help have a database where they store their
publications in SGML. So far they have only stored languages that are
more or less "simple" and have managed to pick out about 200 entities
that have to be translated to SGML. Now they want to include Japaneese,
Korean and Chinese into the database and suddenly the database will
increase to several thousands of entities.

Since asian languages use several thousands of characters I have
constructed a program that you can paste unicode text into (simply copy
the text of several publications into a textbox) and the program will
then create a list with each unique character together with their
hexadecimal code. So far I've imported only japaneese and got about 900
unique characters.

Now I want their SGML/XML-entity equivalents (or even HTML).

I've created a shortcut that works but isnt very beautyful - I copy the
list of unicode characters into a HTML-parser and then copy the
"translation" into another textbox and another function connects the
HTML-entities to the unicode by cycling through every entity. I would
like to skip this part if it is possible :)

Best regards,

/Jonas

Martin Honnen skrev:

Frantic wrote:

I'm working on a list of japaneese entities that contain the entity,
the unicode hexadecimal code and the xml/sgml entity used for that
entity. A unicode document is read into the program, then the program
sorts out every doublet and the hexadecimal unicode code is extracted,
but I dont know a way to find the xml or sgml-entity equivalent to the
unicode code. Anyone who could give me a pointer?

If you have the Unicode code number then in XML as well as SGML you can do
&#number;
or
&#xhexnumber;
In XML there are only a few entities predefined (e.g. quot, apos, lt,
gt, amp) so you would need to keep those in a dictionary or similar.

SGML itself does not predefine more entities I think, if you are
thinking about HTML and not SGML then HTML 4 defines some entities which
you can find here:
<http://www.w3.org/TR/html4/sgml/entities.html>
So you need to put those into a dictionary.

If you have an SGML or XML DTD defining character entities and you want
to read them out programmatically then you can do with an SGML
respectively XML parser. .NET has a built-in XML parser that can parse
DTDs but there is not much of an API to get at the information in the
DTD while the parsing is done.
The DOM API however, after the parsing, gives you some information about
entities:
C# example:

XmlDocument xmlDocument = new XmlDocument();
xmlDocument.LoadXml(@"<!DOCTYPE example [
<!ENTITY auml ""ä"">
]>
<example>Kibology</example>");
foreach (XmlEntity entity in xmlDocument.DocumentType.Entities) {
Console.WriteLine("Enity name: {0}, replacement: {1}.",
entity.Name, entity.InnerText);
}

prints out

Enity name: auml, replacement: ä.

--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/

Aug 1 '06 #3

by: Robert Oschler | last post by:

Is there a module/function to remove all the HTML entities from an HTML document (e.g. - &nbsp, &amp, &apos, etc.)? If not I'll just write one myself but I figured I'd save myself some time. ...

Python

SX- sgml to xml conversion problems

by: Usman | last post by:

Dear friends, I would like to ask about James Clark sx.exe parser from SGML to XML. I write the batch file like this : "E:\Project\sx\sx.exe" -wall "-DE:\Project\sx\entities"...

.NET Framework

???XML vs SGML for unicode support???

by: krammer | last post by:

Hello, Can any one please give me a short but concise pros and cons list of Unicode support in both SGML and XML? long story short, we are gonna port our leagacy SGML files to XML and the new...

.NET Framework

sgml vs unicode notation

by: S. | last post by:

if in my website i am using the sgml { notation, is it accurate to say to my users that the site uses unicode or that it requires unicode? is there a mathematical formula to calculate a unicode...

HTML / CSS

need help with agonizing struggle to standardize my code on UTF-8 encoding

by: lawrence | last post by:

I'm just now trying to give my site a character encoding of UTF-8. The site has been built in a hodge-podge way over the last 6 years. The validator tells me I've lots of characters that don't...

HTML / CSS

ASP converts Unicode Chars to HTML entities?

by: Beat Richli | last post by:

Hello i have following problem with ASP (using Interdev, Win2003 Server): if a special character is entered in a textbox, ASP or the Client Browser (IE 6) seems to convert this character in HTML...

ASP / Active Server Pages

Unicode entities in email subject

by: Laangen_LU | last post by:

Dear Group, my first post to this group, so if I'm on the wrong group, my apologies. I'm trying to send out an email in Chinese lanuage using the mail() function in PHP. Subject and...

PHP

Python and decimal character entities over 128.

by: bsagert | last post by:

Some web feeds use decimal character entities that seem to confuse Python (or me). For example, the string "doesn't" may be coded as "doesn’t" which should produce a right leaning apostrophe....

Python

how to convert entities into unicode

by: neovantage | last post by:

hey geeks, I am using a function which convert unicode to entities. So that i can save values into mysql database into entities. This function really helps me when i display the store entity data...

PHP

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Need to extract XML or SGML entities from a Unicode text

Similar topics