Need help on string manipulation

WaterWalk

Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.

Mar 27 '06 #1

Subscribe Post Reply

3457

Richard G. Riley

On 2006-03-27, WaterWalk <to********@163.com> wrote:

Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.

I was looking up some similar stuff recently : here is an ok start -

http://www.chemie.fu-berlin.de/chemn...c/libc_18.html

This concentrates more on "big chars" as I call them : characters
encoded in more that that standard char.

At the end of the day, do you really need dbcs or mbcs support?

good luck. hope the link helps.

--
Debuggers : you know it makes sense.
http://heather.cs.ucdavis.edu/~matlo...g.html#tth_sEc

Mar 27 '06 #2

liljencrantz

WaterWalk skrev:

Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.
Characters represented by wchar_t must use one wchar_t per character,
unlike characters using char, which may use a multibyte encoding. The
actual size and encoding of wchar_t is undefined, and e.g. Dragonfly
BSD uses different encodings of wchar_t depending on the encoding of
char strings. If Windows uses a 16-bit wchar_t, you will be unable to
use some newer Unicode characters, if this is a problem for you, then
avoid wchar_t. You will not have this problem under Linux, since glibc
uses the UCS4, which is 31-bit.

Things like being able to use [] to access a character with a specific
index, being able to use int:s to iterate over a string and being able
to examine a specific character without worrying about if it's a
multibyte character makes life _much_ easier.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.

I have written a non-trivial program called fish (It's a commandline
shell for Unix, kind of like bash or zsh) that uses wide character
strings internally, you can download it from
http://roo.no-ip.org/fish/.

The lessons I've learned from this:

* Converting from char strings to wchar_t is not hard, but while C
string handling functions have wide character equivalents, most other
functions don't. Mixing narrow and wide character strings in the same
program is a nightmare, you will end up with a maze of spagetti that
you will never be able to untangle. Don't. The best way to get around
this long-term is to use a wrapper library around the functions you
want. I have written a wrapper around some common Unix functions like
open, fopen, stat, access, realpath, etc.. You can simply borrow the
wutil.c and wutil.h files from fish to use with your own project if you
wish, it should be easy to extend to more functions. (Provided you
aren't writing comercial software - fish is GPL:ed)

* If your program faces the user a lot, it is likely that you will be
exposed to data in the wrong character set, e.g. it is not uncommon to
have some filenames in Latin-1 even if your system should use UTF-8. I
handle this in a way that breaks on systems that don't use a Unicode
representation for wchar_t, but that works with 16-bit Unicode
encodings. Specifically, I have a special set of conversion routines
that takes bytes that failed to convert correctly and map them to a
byte range inside the Unicode private use area. That way, I have a
reversible representation of invalid characters, which means I can do
e.g. wildcarding on filenames with invalid characters. The biggest
hurdle with this method is that you have to make sure to _always_ use
your 'magical' conversion routines and not the ones supplied by the
system. I hope that eventually all wide character sets will provide a
private use area of some sort, it is a very useful feature.

* Using getopt and gettext for option parsing and i18n also took some
work, but it was doable. I used search-and-replace on the getopt source
to make 'wgetopt', and wrote a wrapper arounf getopt.

* Memory usage may increase by a factor of 4 or more. 90% of the
allocated memory in fish is used to store strings, so the memory usage
increase is significant. Because there are conversions taking place,
additional memory is required to store both the narrow and wide version
of a string at once.

In the end, I still think it is often worth using wide character
strings, since you avoid the huge hassle of handling what could be
multibyte encodings of strings. You might also consider using a string
handling library that does all the evil string handling for you,
though. Some things will be easier that way - others harder.

Good luck. I have found Unicode in C to be hard - but not undoable.

--
Axel

Mar 27 '06 #3

WaterWalk

li**********@gmail.com å†™é“ï¼š

WaterWalk skrev:
Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.
Characters represented by wchar_t must use one wchar_t per character,
unlike characters using char, which may use a multibyte encoding. The
actual size and encoding of wchar_t is undefined, and e.g. Dragonfly
BSD uses different encodings of wchar_t depending on the encoding of
char strings. If Windows uses a 16-bit wchar_t, you will be unable to
use some newer Unicode characters, if this is a problem for you, then
avoid wchar_t. You will not have this problem under Linux, since glibc
uses the UCS4, which is 31-bit.

Yes, This is my problem. If any unicode char can be encoded in a single
wchar_t, then life will be much easier. *BUT*, on windows, I can't
simply use wchar_t which is only 16-bit to represent all unicode
characters. I hear that MS WORD uses 2 wchar_t chars to hold those
"extented characters". Then, if one char in a string needs be changed,
the handy array index operation can't be used. What's more, the whole
string may need change. This is really annoying. Any ideas?
Things like being able to use [] to access a character with a specific
index, being able to use int:s to iterate over a string and being able
to examine a specific character without worrying about if it's a
multibyte character makes life _much_ easier.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.

I have written a non-trivial program called fish (It's a commandline
shell for Unix, kind of like bash or zsh) that uses wide character
strings internally, you can download it from
http://roo.no-ip.org/fish/.

For some reason, I can't visit this site. Feel sad.

Mar 28 '06 #4

Ben Bacarisse

On Mon, 27 Mar 2006 22:29:09 -0800, WaterWalk wrote:

Characters represented by wchar_t must use one wchar_t per character,
unlike characters using char, which may use a multibyte encoding. The
actual size and encoding of wchar_t is undefined, and e.g. Dragonfly
BSD uses different encodings of wchar_t depending on the encoding of
char strings. If Windows uses a 16-bit wchar_t, you will be unable to
use some newer Unicode characters, if this is a problem for you, then
avoid wchar_t. You will not have this problem under Linux, since glibc
uses the UCS4, which is 31-bit.

Yes, This is my problem. If any unicode char can be encoded in a single
wchar_t, then life will be much easier. *BUT*, on windows, I can't
simply use wchar_t which is only 16-bit to represent all unicode
characters. I hear that MS WORD uses 2 wchar_t chars to hold those
"extented characters". Then, if one char in a string needs be changed,
the handy array index operation can't be used. What's more, the whole
string may need change. This is really annoying. Any ideas?

For your information, the most common encoding in which multiple 16-bit
objects are used for some Unicode code points is called UTF16. If you
want to use glibc's indexable UCS4 encoding, you can use the GNU C tool
chain on Windows. If not, you may get better answers about this in an MS
Windows programming group.

--
Ben.

Mar 29 '06 #5

Similar topics

I need pointers

by: jbailo | last post by:

I've been working on my gcc/Gtk+ application and realized how much I need pointers. String manipulation in c is so beautiful and elegant and fast -- its way above all the dum-dum 'methods' of...

.NET Framework

String comparison / modification (Bug????)

by: Dim | last post by:

I found that C# has some buggy ways to process string across methods. I have a class with on global string var and a method where i add / remove from this string Consider it a buffer... with some...

C# / C Sharp

Need help : DSN connection string

by: Tee | last post by:

Hi, I need some help here for DSN connection string. I know it's not recommended to use DSN, even I dont like it as well ... but for now, my situation is I am using a shared hosting. I do not...

ASP.NET

Need Help - "Circular" string comparaison

by: Rogers | last post by:

I want to compare strings of numbers that have a circular boundary condition. This means that the string is arranged in a loop without an end-of-string. The comparaison of two strings now...

Visual Basic .NET

need help with a string manipulation.

by: Learner | last post by:

Hello, I am trying to create few dynamic controls and once they are rendered I need to save the information that was entered into these dynamic fileds. For instance when I create 3 radio button...

Visual Basic .NET

string manipulation

by: crprajan | last post by:

String Manipulation: Given a string like “This is a string”, I want to remove all single characters( alphabets and numerals) like (a, b, 1, 2, .. ) . So the output of the string will be “This is...

.NET Framework

.Net Connection String Security

by: Frinavale | last post by:

I currently have a .NET application that has an object which passes a string (a connection string) as a parameter to another object that does database manipulation. This string isn't stored...

.NET Framework

string class problem

by: mann_mathann | last post by:

can anyone tell me a solution: i cannot use the features in standard c++ string classgh i included the string.h file but still its not working.

C / C++

NEW TradeStation 8 Add-on - L'eau Prosper Market Manipulation Profiling Tools Set By L'eau Prosper Research

by: L'eau Prosper Research | last post by:

NEW TradeStation 8 Add-on - L'eau Prosper Market Manipulation Profiling Tools Set By L'eau Prosper Research Press Release: L'eau Prosper Research (Website: http://www.leauprosper.com) releases...

Microsoft Access / VBA

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++