473,729 Members | 2,235 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Need help on string manipulation

Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.

Mar 27 '06 #1
4 3488
On 2006-03-27, WaterWalk <to********@163 .com> wrote:
Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.


I was looking up some similar stuff recently : here is an ok start -

http://www.chemie.fu-berlin.de/chemn...c/libc_18.html

This concentrates more on "big chars" as I call them : characters
encoded in more that that standard char.

At the end of the day, do you really need dbcs or mbcs support?

good luck. hope the link helps.

--
Debuggers : you know it makes sense.
http://heather.cs.ucdavis.edu/~matlo...g.html#tth_sEc
Mar 27 '06 #2

WaterWalk skrev:
Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.
Characters represented by wchar_t must use one wchar_t per character,
unlike characters using char, which may use a multibyte encoding. The
actual size and encoding of wchar_t is undefined, and e.g. Dragonfly
BSD uses different encodings of wchar_t depending on the encoding of
char strings. If Windows uses a 16-bit wchar_t, you will be unable to
use some newer Unicode characters, if this is a problem for you, then
avoid wchar_t. You will not have this problem under Linux, since glibc
uses the UCS4, which is 31-bit.

Things like being able to use [] to access a character with a specific
index, being able to use int:s to iterate over a string and being able
to examine a specific character without worrying about if it's a
multibyte character makes life _much_ easier.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.


I have written a non-trivial program called fish (It's a commandline
shell for Unix, kind of like bash or zsh) that uses wide character
strings internally, you can download it from
http://roo.no-ip.org/fish/.

The lessons I've learned from this:

* Converting from char strings to wchar_t is not hard, but while C
string handling functions have wide character equivalents, most other
functions don't. Mixing narrow and wide character strings in the same
program is a nightmare, you will end up with a maze of spagetti that
you will never be able to untangle. Don't. The best way to get around
this long-term is to use a wrapper library around the functions you
want. I have written a wrapper around some common Unix functions like
open, fopen, stat, access, realpath, etc.. You can simply borrow the
wutil.c and wutil.h files from fish to use with your own project if you
wish, it should be easy to extend to more functions. (Provided you
aren't writing comercial software - fish is GPL:ed)

* If your program faces the user a lot, it is likely that you will be
exposed to data in the wrong character set, e.g. it is not uncommon to
have some filenames in Latin-1 even if your system should use UTF-8. I
handle this in a way that breaks on systems that don't use a Unicode
representation for wchar_t, but that works with 16-bit Unicode
encodings. Specifically, I have a special set of conversion routines
that takes bytes that failed to convert correctly and map them to a
byte range inside the Unicode private use area. That way, I have a
reversible representation of invalid characters, which means I can do
e.g. wildcarding on filenames with invalid characters. The biggest
hurdle with this method is that you have to make sure to _always_ use
your 'magical' conversion routines and not the ones supplied by the
system. I hope that eventually all wide character sets will provide a
private use area of some sort, it is a very useful feature.

* Using getopt and gettext for option parsing and i18n also took some
work, but it was doable. I used search-and-replace on the getopt source
to make 'wgetopt', and wrote a wrapper arounf getopt.

* Memory usage may increase by a factor of 4 or more. 90% of the
allocated memory in fish is used to store strings, so the memory usage
increase is significant. Because there are conversions taking place,
additional memory is required to store both the narrow and wide version
of a string at once.

In the end, I still think it is often worth using wide character
strings, since you avoid the huge hassle of handling what could be
multibyte encodings of strings. You might also consider using a string
handling library that does all the evil string handling for you,
though. Some things will be easier that way - others harder.

Good luck. I have found Unicode in C to be hard - but not undoable.

--
Axel

Mar 27 '06 #3

li**********@gm ail.com 写道:
WaterWalk skrev:
Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.
Characters represented by wchar_t must use one wchar_t per character,
unlike characters using char, which may use a multibyte encoding. The
actual size and encoding of wchar_t is undefined, and e.g. Dragonfly
BSD uses different encodings of wchar_t depending on the encoding of
char strings. If Windows uses a 16-bit wchar_t, you will be unable to
use some newer Unicode characters, if this is a problem for you, then
avoid wchar_t. You will not have this problem under Linux, since glibc
uses the UCS4, which is 31-bit.


Yes, This is my problem. If any unicode char can be encoded in a single
wchar_t, then life will be much easier. *BUT*, on windows, I can't
simply use wchar_t which is only 16-bit to represent all unicode
characters. I hear that MS WORD uses 2 wchar_t chars to hold those
"extented characters". Then, if one char in a string needs be changed,
the handy array index operation can't be used. What's more, the whole
string may need change. This is really annoying. Any ideas?
Things like being able to use [] to access a character with a specific
index, being able to use int:s to iterate over a string and being able
to examine a specific character without worrying about if it's a
multibyte character makes life _much_ easier.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.


I have written a non-trivial program called fish (It's a commandline
shell for Unix, kind of like bash or zsh) that uses wide character
strings internally, you can download it from
http://roo.no-ip.org/fish/.

For some reason, I can't visit this site. Feel sad.

Mar 28 '06 #4
On Mon, 27 Mar 2006 22:29:09 -0800, WaterWalk wrote:
Characters represented by wchar_t must use one wchar_t per character,
unlike characters using char, which may use a multibyte encoding. The
actual size and encoding of wchar_t is undefined, and e.g. Dragonfly
BSD uses different encodings of wchar_t depending on the encoding of
char strings. If Windows uses a 16-bit wchar_t, you will be unable to
use some newer Unicode characters, if this is a problem for you, then
avoid wchar_t. You will not have this problem under Linux, since glibc
uses the UCS4, which is 31-bit.


Yes, This is my problem. If any unicode char can be encoded in a single
wchar_t, then life will be much easier. *BUT*, on windows, I can't
simply use wchar_t which is only 16-bit to represent all unicode
characters. I hear that MS WORD uses 2 wchar_t chars to hold those
"extented characters". Then, if one char in a string needs be changed,
the handy array index operation can't be used. What's more, the whole
string may need change. This is really annoying. Any ideas?


For your information, the most common encoding in which multiple 16-bit
objects are used for some Unicode code points is called UTF16. If you
want to use glibc's indexable UCS4 encoding, you can use the GNU C tool
chain on Windows. If not, you may get better answers about this in an MS
Windows programming group.

--
Ben.
Mar 29 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
2010
by: jbailo | last post by:
I've been working on my gcc/Gtk+ application and realized how much I need pointers. String manipulation in c is so beautiful and elegant and fast -- its way above all the dum-dum 'methods' of semi-script languages such as c#.
4
2272
by: Dim | last post by:
I found that C# has some buggy ways to process string across methods. I have a class with on global string var and a method where i add / remove from this string Consider it a buffer... with some values and separators class { private string globalvar = ""; private void manipulate (whattodo ) //substring, join, etc....
7
2128
by: Tee | last post by:
Hi, I need some help here for DSN connection string. I know it's not recommended to use DSN, even I dont like it as well ... but for now, my situation is I am using a shared hosting. I do not have write access to my web root folder, I have a folder specialy for database. I am currently using access, it is just for testing purpose. here's the info:
23
2097
by: Rogers | last post by:
I want to compare strings of numbers that have a circular boundary condition. This means that the string is arranged in a loop without an end-of-string. The comparaison of two strings now becomes a different operation than with regular strings because the circular string can be "rotated", like this: 1 2 3 4 5 2 3 4 5 1 3 4 5 1 2 4 5 1 2 3
10
1341
by: Learner | last post by:
Hello, I am trying to create few dynamic controls and once they are rendered I need to save the information that was entered into these dynamic fileds. For instance when I create 3 radio button dynamic controls I get the ID of this controls as rdb29OPT0 rdb30OPT1
3
1562
by: crprajan | last post by:
String Manipulation: Given a string like This is a string, I want to remove all single characters( alphabets and numerals) like (a, b, 1, 2, .. ) . So the output of the string will be This is string This is very urgent. Please help
7
2102
Frinavale
by: Frinavale | last post by:
I currently have a .NET application that has an object which passes a string (a connection string) as a parameter to another object that does database manipulation. This string isn't stored anywhere else and is only used by this behind-the-scenes object to provide the database manipulation object with a connection string. Does my connection string pose a security problem when it is inside the code like this? Or are connection strings...
22
2609
by: mann_mathann | last post by:
can anyone tell me a solution: i cannot use the features in standard c++ string classgh i included the string.h file but still its not working.
0
2350
by: L'eau Prosper Research | last post by:
NEW TradeStation 8 Add-on - L'eau Prosper Market Manipulation Profiling Tools Set By L'eau Prosper Research Press Release: L'eau Prosper Research (Website: http://www.leauprosper.com) releases new TradeStation 8 Add-on - L'eau Prosper Market Manipulation Profiling Tools Set. L'eau Prosper Market Manipulation Profiling Tools Set is a set of
0
8761
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9280
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9142
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8144
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing, and deploymentwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6722
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6016
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4525
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4795
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3238
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.