473,396 Members | 2,033 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Extracting Unicode characters from RTF

Hi All,
I have come across a difficult problem to do with extracting UniCode characters from RTF strings.
A detailed description of my problem is below, if anyone could help, it would be much appreciated. I've tried to make the problem as clear as possible, but if any clarification is needed please let me know.

Task
-Convert RTF2 formatted text containing foreign characters (UniCode) to PlainText.

Background

-We are using Stephan Lebans RTF2 control to display and edit text.
-RTF2 fields cannot be displayed appropriately on reports, so unformatted text must be stored in database.
-The RTF2 parser cannot handle Unicode (our overseas clients, specifically Romania, use Unicode characters), so often the rtf2.PlainText method returns strings containing ???
-I have built a simple parser to convert Hex values in rtf2.RTFText to characters
-Given a character table, I can add functionality to generate characters appropriately depending on RTF Character Set defined in .RTFText.

Question
-Where can I find a character table for the Character Sets specified in .RTFText (specifically fcharset238)?

Technical/Testing info:
Fonts
These are the 2 relevant fonts:
F1: {\f1\fnil\fcharset0 MS Sans Serif;}
F2: {\f2\fswiss\fcharset238{\*\fname Arial;}Arial CE;}

*Testing in MSWord showed that the actual font (Sans Serif, Arial etc made no difference to presented character, so fcharset is most likely the issue).

Keys
-Pressing ";" usually generates "ş" (hereby referred to as "s")
-However, when in VB6 code window it generates "º" (this probably isn't important).
-Copy/pasting from/into VB6 code window alternates between the characters.

RTF
-In RTF format, abnormal characters are partly referenced by “\’XX” with XX being their hex values. Eg the RTF string “xxx\’BAxxx” corresponds to “xxxşxxx”.
-In RTF format, abnormal characters are partly referenced by the specified font.

-So, the actual character displayed is dependent on the hex value, as well as the font (character set) specified in RTF.

Characters
Below is a table indicating my observations for a character. Hex Value and Font are the inputs.

Hex Value || Font ||Character Displayed || Unicode for Character Displayed
BA || F1 || ş || 00BA
BA || F2 || º || 015F
Mar 4 '08 #1
6 3885
I should have mentioned, testing was carried out with Input Language set to Romanian.
Greg
Mar 4 '08 #2
NeoPa
32,556 Expert Mod 16PB
Greg,

I commend you on the care taken to specify the question as well as the trouble you've obviously already gone to to find a solution yourself.

I'm afraid I can't help you directly with this issue, but I will flag it for some of the other Access experts to come and have a look-see in case any of them can help. It is more of a problem come across using Access than an Access problem per-se though, so if we can find no joy in here it may be worth throwing up a link to this thread in the Windows forum too.

Let's see what flagging to the other Access experts can do for us first though.
Mar 5 '08 #3
Scott Price
1,384 Expert 1GB
The ChrW() function will return/display the character associated with the hex value of any Unicode character.

Syntax is ChrW(&H15F) this displays correctly the s with cedilla below in a simple text box that I set up in my test database. Using ChrW(&HBA) displays the degree symbol that you mention. You mention them being the other way 'round, which makes me wonder if that isn't a typo?

I'm not personally familiar with Lebans' RTF2, but after doing a little research into the character sets and code pages involved, it looks to me that you actually have a code page problem, not a fcharset problem. For example, the codepage for Latin 2 is 1250 (see here), and maps the code page character BA to the Unicode character 015F. However, the code page 1252 (see here) kindly takes the same code page character BA and maps it to the Unicode character BA which corresponds to the masculine ordinal indicator (so it says... Just means the degree character more or less).

My suggestion is that you are receiving the text encoded with code page 1250 and interpreting it based on the 1252 encoding.

Again, I'm not familiar with Lebans' RTF2, but somehow you will need to find the coding to change this encoding/decoding discrepancy. Sorry to not be able to give you any specific help on doing that :-(

Regards,
Scott
Mar 5 '08 #4
Scott Price
1,384 Expert 1GB
A few links that contain helpful and not so helpful information that I came across in my research:

MS developer discussion

Character sets and Code pages

Wikipedia Character Encoding

Wikipedia Code Pages

Wikipedia Romanian Alphabet

Kind regards,
Scott
Mar 5 '08 #5
Thanks for your suggestion Scott. I think the 1252 code page will point us in the right direction. Will let you know how we go.
Mar 6 '08 #6
Scott Price
1,384 Expert 1GB
Let me know how it goes! Good luck.

Regards,
Scott
Mar 6 '08 #7

Sign in to post your reply or Sign up for a free account.

Similar topics

48
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...
4
by: Basil | last post by:
Hello. I have compiler BC Builder 6.0. I have an example: #include <strstrea.h> int main () { wchar_t ff = {' s','d ', 'f', 'g', 't'};
7
by: Michael Davis | last post by:
Hi, I've known C/C++ for years, but only ever used ascii strings. I have a client who wants to know how gcc handles unicode. I've found the functions utf8_mbtowc, utf8_mbstowcs, utf8_wctomb and...
3
by: code_wrong | last post by:
hi, I decided to extract the text from some powerpoint files. The results have thrown up some questions. When I use the 'char *valid' character array (in the program below) to choose the...
18
by: Ger | last post by:
I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found...
14
by: abhi147 | last post by:
Hi , I want to convert an array of bytes like : {79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3} into Unicode character with ISO-8859-1 standard. Can anyone help me .. how should...
8
by: Preben Randhol | last post by:
Hi If I use len() on a string containing unicode letters I get the number of bytes the string uses. This means that len() can report size 6 when the unicode string only contains 3 characters...
7
by: 7stud | last post by:
Based on this example and the error: ----- u_str = u"abc\u9999" print u_str UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in position 3: ordinal not in range(128) ------
8
by: mario | last post by:
I have checks in code, to ensure a decode/encode cycle returns the original string. Given no UnicodeErrors, are there any cases for the following not to be True? unicode(s, enc).encode(enc)...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.