473,507 Members | 2,545 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

OpenSP API, Unicode character byte offsets

Hello,

I'm posting here with a somewhat technical question in the hope
of finding someone with experience coding C++ against the SP_API
in OpenSP 1.5.

I have an app that uses the SP_API to parse XML and record file
offsets for elements and attribute values. It works fine with
ISO-8859-1 encoded data. However, in UTF-8 encoded XML data, it
corrupts element and attribute names composed of characters that
encode as UTF-8 multi-byte sequences and only gives me character
(as opposed to byte) offsets which are useless to me when I need
to do low-level i/o on the data.

My XML begins with:

<?xml version="1.0" encoding="utf-8"?>

and I'm setting these envvars in my main program:

(void)putenv("SP_CHARSET_FIXED=YES");
(void)putenv("SP_ENCODING=XML");

The parser gets the *character* count right but the *byte* count
(or length) wrong. E.g. it tells me that the element name
composed of the 3 Greek characters U+03D5 U+03AC U+03C9 has
length=3 but the number of bytes the element occupies in my XML
file is 6, i.e. 2 bytes per character. I can't find anything in
the API that will return me the *byte* offset. What have I
missed?

Further, e.g. in the case of an attribute name, when I ask for
the attribute's name to do processing on it I only get back as
many bytes as characters in the name which is wrong for a
multi-byte encoding like UTF-8 and the bytes I do get back are
corrupt as UTF-8.

Here, simplified, is my attribute event handler:

void XRegionEventHandler::attributes (
const AttributeList &attributes,
const StorageObjectSpec *el_storageObj,
unsigned long &epos )
{
char name[MAX_NAME];
const StorageObjectSpec *attr_storageObj;
size_t nAttributes = attributes.size();
unsigned long spos;

for (size_t i = 0; i < nAttributes; i++)
{
const Text *text;
const StringC *string;
const AttributeValue *value = attributes.value(i);

if (value)
{
switch (value->info(text, string))
{
case AttributeValue::cdata:
{
TextIter iter(*text);
TextItem::Type type;
const Char *p;
size_t length;
const Location *loc;

while (iter.next(type, p, length, loc))
{
epos = spos + length - CHAR_SIZE;

name << attributes.name(i);

// process "name" here ... }
}
break;

default:
break;
}
}
}
}

I've walked through the overloaded << operator in the line:

name << attributes.name(i);

and attributes.name(i) the name is already corrupt.

Has anyone successfully parsed UTF-8 encoded multi-byte XML and
retrieved byte offsets and the UTF-8 encoded form of the element
and attribute names?

Any help much appreciated and thanks,

Phil
----
Phillip Farber, Programmer
Digital Library Production Service
Hatcher Graduate Library, University of Michigan

Jul 20 '05 #1
0 1974

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

14
2824
by: wolfgang haefelinger | last post by:
Hi, I wonder whether someone could explain me a bit what's going on here: import sys # I'm running Mandrake 1o and Windows XP. print sys.version ## 2.3.3 (#2, Feb 17 2004, 11:45:40)
22
5441
by: Keith MacDonald | last post by:
Hello, Is there a portable (at least for VC.Net and g++) method to convert text between wchar_t and char, using the standard library? I may have missed something obvious, but the section on...
4
3156
by: Basil | last post by:
Hello. I have compiler BC Builder 6.0. I have an example: #include <strstrea.h> int main () { wchar_t ff = {' s','d ', 'f', 'g', 't'};
11
3622
by: Patrick Van Esch | last post by:
Hello, I have the following problem of principle: in writing HTML pages containing ancient greek, there are two possibilities: one is to write the unicode characters directly (encoded as two...
1
17497
by: anantvrana | last post by:
Hello All, I am trying to read Unicode (Kanji character) data from a text file. When I store unicode data into variable my Kanji character gets messed up. I am using following code Open...
15
2076
by: John Salerno | last post by:
Forgive my newbieness, but I don't quite understand why Unicode is still something that needs special treatment in Python (and perhaps elsewhere). I'm reading Dive Into Python right now, and it...
29
3485
by: Ron Garret | last post by:
>>> u'\xbd' u'\xbd' >>> print _ Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in...
17
10613
by: =?Utf-8?B?R2Vvcmdl?= | last post by:
Hello everyone, Wide character and multi-byte character are two popular encoding schemes on Windows. And wide character is using unicode encoding scheme. But each time I feel confused when...
24
3340
by: Donn Ingle | last post by:
Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything...
0
7223
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7110
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7314
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
1
5041
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4702
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3191
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3179
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
758
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
411
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.