473,387 Members | 1,705 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Need help on string manipulation

Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.

Mar 27 '06 #1
4 3457
On 2006-03-27, WaterWalk <to********@163.com> wrote:
Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.


I was looking up some similar stuff recently : here is an ok start -

http://www.chemie.fu-berlin.de/chemn...c/libc_18.html

This concentrates more on "big chars" as I call them : characters
encoded in more that that standard char.

At the end of the day, do you really need dbcs or mbcs support?

good luck. hope the link helps.

--
Debuggers : you know it makes sense.
http://heather.cs.ucdavis.edu/~matlo...g.html#tth_sEc
Mar 27 '06 #2

WaterWalk skrev:
Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.
Characters represented by wchar_t must use one wchar_t per character,
unlike characters using char, which may use a multibyte encoding. The
actual size and encoding of wchar_t is undefined, and e.g. Dragonfly
BSD uses different encodings of wchar_t depending on the encoding of
char strings. If Windows uses a 16-bit wchar_t, you will be unable to
use some newer Unicode characters, if this is a problem for you, then
avoid wchar_t. You will not have this problem under Linux, since glibc
uses the UCS4, which is 31-bit.

Things like being able to use [] to access a character with a specific
index, being able to use int:s to iterate over a string and being able
to examine a specific character without worrying about if it's a
multibyte character makes life _much_ easier.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.


I have written a non-trivial program called fish (It's a commandline
shell for Unix, kind of like bash or zsh) that uses wide character
strings internally, you can download it from
http://roo.no-ip.org/fish/.

The lessons I've learned from this:

* Converting from char strings to wchar_t is not hard, but while C
string handling functions have wide character equivalents, most other
functions don't. Mixing narrow and wide character strings in the same
program is a nightmare, you will end up with a maze of spagetti that
you will never be able to untangle. Don't. The best way to get around
this long-term is to use a wrapper library around the functions you
want. I have written a wrapper around some common Unix functions like
open, fopen, stat, access, realpath, etc.. You can simply borrow the
wutil.c and wutil.h files from fish to use with your own project if you
wish, it should be easy to extend to more functions. (Provided you
aren't writing comercial software - fish is GPL:ed)

* If your program faces the user a lot, it is likely that you will be
exposed to data in the wrong character set, e.g. it is not uncommon to
have some filenames in Latin-1 even if your system should use UTF-8. I
handle this in a way that breaks on systems that don't use a Unicode
representation for wchar_t, but that works with 16-bit Unicode
encodings. Specifically, I have a special set of conversion routines
that takes bytes that failed to convert correctly and map them to a
byte range inside the Unicode private use area. That way, I have a
reversible representation of invalid characters, which means I can do
e.g. wildcarding on filenames with invalid characters. The biggest
hurdle with this method is that you have to make sure to _always_ use
your 'magical' conversion routines and not the ones supplied by the
system. I hope that eventually all wide character sets will provide a
private use area of some sort, it is a very useful feature.

* Using getopt and gettext for option parsing and i18n also took some
work, but it was doable. I used search-and-replace on the getopt source
to make 'wgetopt', and wrote a wrapper arounf getopt.

* Memory usage may increase by a factor of 4 or more. 90% of the
allocated memory in fish is used to store strings, so the memory usage
increase is significant. Because there are conversions taking place,
additional memory is required to store both the narrow and wide version
of a string at once.

In the end, I still think it is often worth using wide character
strings, since you avoid the huge hassle of handling what could be
multibyte encodings of strings. You might also consider using a string
handling library that does all the evil string handling for you,
though. Some things will be easier that way - others harder.

Good luck. I have found Unicode in C to be hard - but not undoable.

--
Axel

Mar 27 '06 #3

li**********@gmail.com 写道:
WaterWalk skrev:
Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.
Characters represented by wchar_t must use one wchar_t per character,
unlike characters using char, which may use a multibyte encoding. The
actual size and encoding of wchar_t is undefined, and e.g. Dragonfly
BSD uses different encodings of wchar_t depending on the encoding of
char strings. If Windows uses a 16-bit wchar_t, you will be unable to
use some newer Unicode characters, if this is a problem for you, then
avoid wchar_t. You will not have this problem under Linux, since glibc
uses the UCS4, which is 31-bit.


Yes, This is my problem. If any unicode char can be encoded in a single
wchar_t, then life will be much easier. *BUT*, on windows, I can't
simply use wchar_t which is only 16-bit to represent all unicode
characters. I hear that MS WORD uses 2 wchar_t chars to hold those
"extented characters". Then, if one char in a string needs be changed,
the handy array index operation can't be used. What's more, the whole
string may need change. This is really annoying. Any ideas?
Things like being able to use [] to access a character with a specific
index, being able to use int:s to iterate over a string and being able
to examine a specific character without worrying about if it's a
multibyte character makes life _much_ easier.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.


I have written a non-trivial program called fish (It's a commandline
shell for Unix, kind of like bash or zsh) that uses wide character
strings internally, you can download it from
http://roo.no-ip.org/fish/.

For some reason, I can't visit this site. Feel sad.

Mar 28 '06 #4
On Mon, 27 Mar 2006 22:29:09 -0800, WaterWalk wrote:
Characters represented by wchar_t must use one wchar_t per character,
unlike characters using char, which may use a multibyte encoding. The
actual size and encoding of wchar_t is undefined, and e.g. Dragonfly
BSD uses different encodings of wchar_t depending on the encoding of
char strings. If Windows uses a 16-bit wchar_t, you will be unable to
use some newer Unicode characters, if this is a problem for you, then
avoid wchar_t. You will not have this problem under Linux, since glibc
uses the UCS4, which is 31-bit.


Yes, This is my problem. If any unicode char can be encoded in a single
wchar_t, then life will be much easier. *BUT*, on windows, I can't
simply use wchar_t which is only 16-bit to represent all unicode
characters. I hear that MS WORD uses 2 wchar_t chars to hold those
"extented characters". Then, if one char in a string needs be changed,
the handy array index operation can't be used. What's more, the whole
string may need change. This is really annoying. Any ideas?


For your information, the most common encoding in which multiple 16-bit
objects are used for some Unicode code points is called UTF16. If you
want to use glibc's indexable UCS4 encoding, you can use the GNU C tool
chain on Windows. If not, you may get better answers about this in an MS
Windows programming group.

--
Ben.
Mar 29 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
by: jbailo | last post by:
I've been working on my gcc/Gtk+ application and realized how much I need pointers. String manipulation in c is so beautiful and elegant and fast -- its way above all the dum-dum 'methods' of...
4
by: Dim | last post by:
I found that C# has some buggy ways to process string across methods. I have a class with on global string var and a method where i add / remove from this string Consider it a buffer... with some...
7
by: Tee | last post by:
Hi, I need some help here for DSN connection string. I know it's not recommended to use DSN, even I dont like it as well ... but for now, my situation is I am using a shared hosting. I do not...
23
by: Rogers | last post by:
I want to compare strings of numbers that have a circular boundary condition. This means that the string is arranged in a loop without an end-of-string. The comparaison of two strings now...
10
by: Learner | last post by:
Hello, I am trying to create few dynamic controls and once they are rendered I need to save the information that was entered into these dynamic fileds. For instance when I create 3 radio button...
3
by: crprajan | last post by:
String Manipulation: Given a string like This is a string, I want to remove all single characters( alphabets and numerals) like (a, b, 1, 2, .. ) . So the output of the string will be This is...
7
Frinavale
by: Frinavale | last post by:
I currently have a .NET application that has an object which passes a string (a connection string) as a parameter to another object that does database manipulation. This string isn't stored...
22
by: mann_mathann | last post by:
can anyone tell me a solution: i cannot use the features in standard c++ string classgh i included the string.h file but still its not working.
0
by: L'eau Prosper Research | last post by:
NEW TradeStation 8 Add-on - L'eau Prosper Market Manipulation Profiling Tools Set By L'eau Prosper Research Press Release: L'eau Prosper Research (Website: http://www.leauprosper.com) releases...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.