473,480 Members | 2,077 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

support for UTF-8 in C language standard?

Does any standard C function support reading or writing UTF-8?
I'm not talking about the trivial case where the text is just the
ASCII subset of UTF-8. Rather, I'm referring to a hypothetical
function that could read UTF-8 when 2, 3, or even 4 byte encodings are
present and store the final unencoded character in, I guess, an array
of 32 bit integers.

I'm guessing that there _might_ be functions for this somewhere
in the C standard because trying to apply typical text manipulations
on a UTF-8 string directly seems to be quite messy and slow.
For instance, even a simple operation like "swap characters 1002->1005
with 2007->2010" would be a pain, you'd pretty much have to
parse from the beginning of the UTF-8 string
just to find the specified ranges, and then they might be different
numbers of bytes. So even though the number of characters is the same
they couldn't just be swapped byte for byte.

Thanks,

David Mathog

Nov 22 '06 #1
3 6717
David Mathog wrote:
Does any standard C function support reading or writing UTF-8?
No.
UTF-8 is pretty simple though, and C code is available everywhere.

For instance, even a simple operation like "swap characters 1002->1005
with 2007->2010" would be a pain, you'd pretty much have to
parse from the beginning of the UTF-8 string
just to find the specified ranges, and then they might be different
numbers of bytes.
And how would having a standard function change that?
Nov 22 '06 #2

David Mathog wrote:
Does any standard C function support reading or writing UTF-8?
I'm not talking about the trivial case where the text is just the
ASCII subset of UTF-8. Rather, I'm referring to a hypothetical
function that could read UTF-8 when 2, 3, or even 4 byte encodings are
present and store the final unencoded character in, I guess, an array
of 32 bit integers.

I'm guessing that there _might_ be functions for this somewhere
in the C standard because trying to apply typical text manipulations
on a UTF-8 string directly seems to be quite messy and slow.
For instance, even a simple operation like "swap characters 1002->1005
with 2007->2010" would be a pain, you'd pretty much have to
parse from the beginning of the UTF-8 string
just to find the specified ranges, and then they might be different
numbers of bytes. So even though the number of characters is the same
they couldn't just be swapped byte for byte.
Yes. Assuming your environment has a locale which supports UTF-8 and
whatever format you want the result in (UCS-4, presumably), then the
multibyte and wide chararcter functions should do what you want - see
mbtowc() and mbstowcs() for starters.

Nov 23 '06 #3
"David Mathog" <ma****@caltech.eduwrote in message
news:ek**********@naig.caltech.edu...
Does any standard C function support reading or writing UTF-8?
I'm not talking about the trivial case where the text is just the
ASCII subset of UTF-8. Rather, I'm referring to a hypothetical
function that could read UTF-8 when 2, 3, or even 4 byte encodings are
present and store the final unencoded character in, I guess, an array
of 32 bit integers.

I'm guessing that there _might_ be functions for this somewhere
in the C standard because trying to apply typical text manipulations
on a UTF-8 string directly seems to be quite messy and slow.
The locale support somewhat addresses this; unfortunately, locale names
are not standardized so your program still won't be portable in practice
even if the code is technically portable. However, if you can find the
right locale on your system, it's possible to use C's standard functions
to turn an input stream into an array of wchar_t's, manipulate them as
desired, and output them again as UTF-8.

<OT>There are a number of third-party libraries that provide a specific
set of conversions including UTF-8, such as libiconv. However, those
libraries are not part of the C Standard itself and thus not portable
either.</OT>

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

--
Posted via a free Usenet account from http://www.teranews.com

Nov 23 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
9507
by: Aditya Ivaturi | last post by:
We have a CMS which is written is based on php & mysql. Recently we received a request to support multiple languages so that sites in that particular laguage can be created. I did some search on...
2
2978
by: Chris | last post by:
I am developing an web application which needs to support all kinds of languages, like english, european character set, and other asian character set. Therefore, UTF-8 can include all those...
5
3055
by: Ma Siva Kumar | last post by:
Running postgresql-7.3.2-3 which came with Red Hat 9.0. Created a database with unicode encoding (in psql) as below: create database leatherlink with encoding='unicode' template=leatherlinkdb;...
4
20538
by: Molina | last post by:
Hi !!! I have a VB.Net application that consumes an WebService. The problem is that the HTTP address works fine, but it will change to an HTTPS address, and it isn't working. The weird thing...
12
9864
by: Rafał Maj Raf256 | last post by:
Hi, I have an UNICODE text file endcoded in UTF-8. I should store the UNICODE strings in my program for example in std::wstring right? To be able to work on them normally, so that std::wstring...
8
16460
by: Divick | last post by:
Hi all, can somebody tell how much std::wstring is supported across different compilers on different platforms? AFAIK std::string is supported by almost all C++ compilers and almost all platforms,...
8
2242
by: sonald | last post by:
Hi, I am using python2.4.1 I need to pass russian text into python and validate the same. Can u plz guide me on how to make my existing code support the russian text. Is there any module...
2
6977
by: Jay | last post by:
Hello, I'm using the PHP Pdf Creation developed by R&OS Ltd (http:// www.ros.co.nz/pdf). Does anyone know of a way to set the default encoding to UTF-8? As of now it's set to ANSI which is a...
3
3887
by: =?Utf-8?B?QWxleGFuZGVy?= | last post by:
Hi! I don't know why, but I want to read a file, change some of the content, and want to write this new content in another file. The problem is, that it contains unicode text. My code is: ...
1
2149
by: =?Utf-8?B?QmlsbHkgWmhhbmc=?= | last post by:
I have a web server 32 bit and SQL server 2005 64bit . Does msdtc support SQL server 2005 64bit with web server 32 bit?
0
7051
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
6915
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7054
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
6993
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5353
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
4794
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4493
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3003
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
1307
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.