OpenSP API, Unicode character byte offsets

Phillip Farber

Hello,

I'm posting here with a somewhat technical question in the hope
of finding someone with experience coding C++ against the SP_API
in OpenSP 1.5.

I have an app that uses the SP_API to parse XML and record file
offsets for elements and attribute values. It works fine with
ISO-8859-1 encoded data. However, in UTF-8 encoded XML data, it
corrupts element and attribute names composed of characters that
encode as UTF-8 multi-byte sequences and only gives me character
(as opposed to byte) offsets which are useless to me when I need
to do low-level i/o on the data.

My XML begins with:

<?xml version="1.0" encoding="utf-8"?>

and I'm setting these envvars in my main program:

(void)putenv("S P_CHARSET_FIXED =YES");
(void)putenv("S P_ENCODING=XML" );

The parser gets the *character* count right but the *byte* count
(or length) wrong. E.g. it tells me that the element name
composed of the 3 Greek characters U+03D5 U+03AC U+03C9 has
length=3 but the number of bytes the element occupies in my XML
file is 6, i.e. 2 bytes per character. I can't find anything in
the API that will return me the *byte* offset. What have I
missed?

Further, e.g. in the case of an attribute name, when I ask for
the attribute's name to do processing on it I only get back as
many bytes as characters in the name which is wrong for a
multi-byte encoding like UTF-8 and the bytes I do get back are
corrupt as UTF-8.

Here, simplified, is my attribute event handler:

void XRegionEventHan dler::attribute s (
const AttributeList &attributes,
const StorageObjectSp ec *el_storageObj,
unsigned long &epos )
{
char name[MAX_NAME];
const StorageObjectSp ec *attr_storageOb j;
size_t nAttributes = attributes.size ();
unsigned long spos;

for (size_t i = 0; i < nAttributes; i++)
{
const Text *text;
const StringC *string;
const AttributeValue *value = attributes.valu e(i);

if (value)
{
switch (value->info(text, string))
{
case AttributeValue: :cdata:
{
TextIter iter(*text);
TextItem::Type type;
const Char *p;
size_t length;
const Location *loc;

while (iter.next(type , p, length, loc))
{
epos = spos + length - CHAR_SIZE;

name << attributes.name (i);

// process "name" here ... }
}
break;

default:
break;
}
}
}
}

I've walked through the overloaded << operator in the line:

name << attributes.name (i);

and attributes.name (i) the name is already corrupt.

Has anyone successfully parsed UTF-8 encoded multi-byte XML and
retrieved byte offsets and the UTF-8 encoded form of the element
and attribute names?

Any help much appreciated and thanks,

Phil
----
Phillip Farber, Programmer
Digital Library Production Service
Hatcher Graduate Library, University of Michigan

Jul 20 '05 #1

Subscribe Reply

1987

Similar topics

2836

unicode question

by: wolfgang haefelinger | last post by:

Hi, I wonder whether someone could explain me a bit what's going on here: import sys # I'm running Mandrake 1o and Windows XP. print sys.version ## 2.3.3 (#2, Feb 17 2004, 11:45:40)

Python

5474

Converting between Unicode and default locale

by: Keith MacDonald | last post by:

Hello, Is there a portable (at least for VC.Net and g++) method to convert text between wchar_t and char, using the standard library? I may have missed something obvious, but the section on codecvt, in Josuttis' "The Standard C++ Library", did not help, and I'm still awaiting delivery of Langer's "Standard C++ IOStreams and Locales". Thanks,

C / C++

3161

Unicode and stream

by: Basil | last post by:

Hello. I have compiler BC Builder 6.0. I have an example: #include <strstrea.h> int main () { wchar_t ff = {' s','d ', 'f', 'g', 't'};

C / C++

3649

Can an HTML source file be specified in unicode ?

by: Patrick Van Esch | last post by:

Hello, I have the following problem of principle: in writing HTML pages containing ancient greek, there are two possibilities: one is to write the unicode characters directly (encoded as two bytes) into the HTML source, and save this source not as an ASCII text, but as a UNICODE text file (using 16 bits per character, also for the Western ASCII characters, which are usually encoded as Ox00XX with XX the ASCII code) ; or to write a pure...

HTML / CSS

17526

Unicode and VBA

by: anantvrana | last post by:

Hello All, I am trying to read Unicode (Kanji character) data from a text file. When I store unicode data into variable my Kanji character gets messed up. I am using following code Open File1 For Input Access Read As #1 While Not EOF(1)

Microsoft Access / VBA

2095

why isn't Unicode the default encoding?

by: John Salerno | last post by:

Forgive my newbieness, but I don't quite understand why Unicode is still something that needs special treatment in Python (and perhaps elsewhere). I'm reading Dive Into Python right now, and it constantly refers to a 'regular string' versus a 'Unicode string' and how you need to convert back and forth. But why isn't Unicode considered a regular string by now? Is it for historical reasons that we still use ASCII and Latin-1? Why can't...

Python

3506

WTF? Printing unicode strings

by: Ron Garret | last post by:

>>> u'\xbd' u'\xbd' >>> print _ Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in range(128) >>>

Python

10645

wide character (unicode) and multi-byte character

by: =?Utf-8?B?R2Vvcmdl?= | last post by:

Hello everyone, Wide character and multi-byte character are two popular encoding schemes on Windows. And wide character is using unicode encoding scheme. But each time I feel confused when talking with another team -- codepage -- at the same time. I am more confused when I saw sometimes we need codepage parameter for wide character conversion, and sometimes we do not need for conversion. Here are two examples,

.NET Framework

3367

LANG, locale, unicode, setup.py and Debian packaging

by: Donn Ingle | last post by:

Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything without 'utf8' in it, then things start to go downhill: 2a. The app assumes unicode objects internally. i.e. Whenever there is

Python

8233

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8675

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8619

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8334

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8474

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6108

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

4078

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4173

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1784

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP