473,396 Members | 1,713 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

strstr for Unicode characters

Hi all,

How could one write an strstr function to work with unicode characters?

Are there existing implementations/solutions/api for doing so?

Any pointers would be appreciated.

Thanks ..

Sep 4 '06 #1
13 14659

? "Kelvin Moss" <km**********@yahoo.com?????? ??? ??????
news:11**********************@p79g2000cwp.googlegr oups.com...
Hi all,

How could one write an strstr function to work with unicode characters?

Are there existing implementations/solutions/api for doing so?

Any pointers would be appreciated.

Thanks ..
For windows, there is wcsstr.
Check
http://msdn.microsoft.com/library/de...c_._mbsstr.asp

--
Papastefanos Serafeim
Sep 4 '06 #2
Kelvin Moss wrote:
Hi all,

How could one write an strstr function to work with unicode characters?

Are there existing implementations/solutions/api for doing so?

Any pointers would be appreciated.

Thanks ..
You could always use memcpy() or memmove().
Sep 5 '06 #3
# Hi all,
#
# How could one write an strstr function to work with unicode characters?
#
# Are there existing implementations/solutions/api for doing so?

String functions should work just fine on UTF-8 encoded unicode
characters - minding that nonASCII characters will have codes greater
than 127 (or less than zero) and might be represented by multiple bytes.
For something like strstr which should only be looking for byte
sequences without embedded zeros, it should be fine, while strchr
can be problematically. There is also wide character (wc...) type
and functions becoming available which will probably be 16 bit or
wider unicode characters.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
Don't say anything. Especially you.
Sep 6 '06 #4
SM Ryan <wy*****@tango-sierra-oscar-foxtrot-tango.fake.orgwrote:
There is also wide character (wc...) type
and functions becoming available which will probably be 16 bit or
wider unicode characters.
for example as UTF16 used on Mac OS X File System ???
--
une bévue
Sep 6 '06 #5

SM Ryan wrote:
# Hi all,
#
# How could one write an strstr function to work with unicode characters?
#
# Are there existing implementations/solutions/api for doing so?

String functions should work just fine on UTF-8 encoded unicode
characters - minding that nonASCII characters will have codes greater
than 127 (or less than zero) and might be represented by multiple bytes.
For something like strstr which should only be looking for byte
sequences without embedded zeros, it should be fine, while strchr
can be problematically.
Yes, I am dealing with UTF8 encoded Unicode characters. So you mean to
say that as long as I don't have embedded zeroes in the strings strstr
should be fine. Right? I think this assumption may not work quite well
in real applications. Your thoughts?

Thanks ..

Sep 6 '06 #6
"Kelvin Moss" <km**********@yahoo.comwrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...
SM Ryan wrote:
>String functions should work just fine on UTF-8 encoded unicode
characters - minding that nonASCII characters will have codes greater
than 127 (or less than zero) and might be represented by multiple
bytes.
For something like strstr which should only be looking for byte
sequences without embedded zeros, it should be fine, while strchr
can be problematically.

Yes, I am dealing with UTF8 encoded Unicode characters. So you mean to
say that as long as I don't have embedded zeroes in the strings strstr
should be fine. Right? I think this assumption may not work quite
well
in real applications. Your thoughts?
UTF-8 won't have any embedded zeroes by definition; the encoding was
specifically designed to work transparently with C code that assumed
ASCII or some 8-bit ASCII-based encoding.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

--
Posted via a free Usenet account from http://www.teranews.com

Sep 6 '06 #7
Ioannis Papadopoulos wrote:
Kelvin Moss wrote:
>Hi all,

How could one write an strstr function to work with unicode characters?

Are there existing implementations/solutions/api for doing so?

Any pointers would be appreciated.

Thanks ..

You could always use memcpy() or memmove().
I do not know what I was thinking at the time. I thought you wanted a
function for strcpy. Please ignore my previous reply.

I tried strstr() for unicode chars and seems to work.
Sep 6 '06 #8
Kelvin Moss wrote:
Hi all,

How could one write an strstr function to work with unicode characters?

Are there existing implementations/solutions/api for doing so?
Do you want to deal with issues such as normalization? E.g. combining
characters can be represented in (many) different ways. In that case,
I've previously worked in a project that used the (IBM) ICU libraries
(licensed under the X license, GPL compatible).

Stijn

Sep 6 '06 #9
"Kelvin Moss" <km**********@yahoo.comwrote:
#
# SM Ryan wrote:
# # Hi all,
# #
# # How could one write an strstr function to work with unicode characters?
# #
# # Are there existing implementations/solutions/api for doing so?
# >
# String functions should work just fine on UTF-8 encoded unicode
# characters - minding that nonASCII characters will have codes greater
# than 127 (or less than zero) and might be represented by multiple bytes.
# For something like strstr which should only be looking for byte
# sequences without embedded zeros, it should be fine, while strchr
# can be problematically.
#
# Yes, I am dealing with UTF8 encoded Unicode characters. So you mean to
# say that as long as I don't have embedded zeroes in the strings strstr
# should be fine. Right? I think this assumption may not work quite well
# in real applications. Your thoughts?
#
# Thanks ..
#
#
#

--
SM Ryan http://www.rawbw.com/~wyrmwif/
I love the smell of commerce in the morning.
Sep 6 '06 #10
pe*******@laponie.com.invalid (=?ISO-8859-1?Q?Une_b=E9vue?=) wrote:
# SM Ryan <wy*****@tango-sierra-oscar-foxtrot-tango.fake.orgwrote:
#
# There is also wide character (wc...) type
# and functions becoming available which will probably be 16 bit or
# wider unicode characters.
#
# for example as UTF16 used on Mac OS X File System ???

MacOSX file paths are UTF-8 encoding of Unicode (16 bit I think).
The file name length limit is the number of UTF-8 bytes.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
Leave it to the Catholics to destroy existence.
Sep 6 '06 #11
"Kelvin Moss" <km**********@yahoo.comwrote:
#
# SM Ryan wrote:
# # Hi all,
# #
# # How could one write an strstr function to work with unicode characters?
# #
# # Are there existing implementations/solutions/api for doing so?
# >
# String functions should work just fine on UTF-8 encoded unicode
# characters - minding that nonASCII characters will have codes greater
# than 127 (or less than zero) and might be represented by multiple bytes.
# For something like strstr which should only be looking for byte
# sequences without embedded zeros, it should be fine, while strchr
# can be problematically.
#
# Yes, I am dealing with UTF8 encoded Unicode characters. So you mean to
# say that as long as I don't have embedded zeroes in the strings strstr
# should be fine. Right? I think this assumption may not work quite well
# in real applications. Your thoughts?

UTF-8 bytes are ASCII characters plus nonzero bytes; UTF-8 encoding
does not insert zero bytes where none existed before. As long as all
you're doing is shuffling bytes around, you can use most str* functions.
Functions like strchr which expect one char to be one character
will only work on the ASCII subset.

In FILEs, you have to negotiate with other programs how they will
interpret byte sequences. If all the applications assume UTF-8
encodings in FILEs, and they handle UTF-8 internally, then everything
will be fine.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
I love the smell of commerce in the morning.
Sep 6 '06 #12
"Kelvin Moss" <km**********@yahoo.comwrote:
#
# SM Ryan wrote:
# # Hi all,
# #
# # How could one write an strstr function to work with unicode characters?
# #
# # Are there existing implementations/solutions/api for doing so?
# >
# String functions should work just fine on UTF-8 encoded unicode
# characters - minding that nonASCII characters will have codes greater
# than 127 (or less than zero) and might be represented by multiple bytes.
# For something like strstr which should only be looking for byte
# sequences without embedded zeros, it should be fine, while strchr
# can be problematically.
#
# Yes, I am dealing with UTF8 encoded Unicode characters. So you mean to
# say that as long as I don't have embedded zeroes in the strings strstr
# should be fine. Right? I think this assumption may not work quite well
# in real applications. Your thoughts?

UTF-8 bytes are ASCII characters plus nonzero bytes; UTF-8 encoding
does not insert zero bytes where none existed before. As long as all
you're doing is shuffling bytes around, you can use most str* functions.
Functions like strchr which expect one char to be one character
will only work on the ASCII subset.

In FILEs, you have to negotiate with other programs how they will
interpret byte sequences. If all the applications assume UTF-8
encodings in FILEs, and they handle UTF-8 internally, then everything
will be fine.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
I love the smell of commerce in the morning.
Sep 6 '06 #13
SM Ryan <wy*****@tango-sierra-oscar-foxtrot-tango.fake.orgwrote:
>
MacOSX file paths are UTF-8 encoding of Unicode (16 bit I think).
The file name length limit is the number of UTF-8 bytes.
you're right is a extract on TN 2078 "migrating from FSSpecs to FSRefs"
:

struct FSRef {
UInt8 hidden[80]; /* private to File Manager*/
};

however at paragraph "FSRefs and long Unicode file names" they wrote :

OSErr FSRefGetName( const FSRef *fsRef, HFSUniStr255 *name )
{
return( FSGetCatalogInfo(fsRef, kFSCatInfoNone, NULL, name, NULL,
NULL) );
}

An HFSUniStr255 is defined as:

struct HFSUniStr255 {
UInt16 length; /* number of unicode characters */
UniChar unicode[255]; /* unicode characters */
};

How file names are encoded

HFS+ disks store file names as UTF-16 in an Apple-modified form of
-------------------------------^^^^^^^
Normalization Form D (decomposed). This form excludes certain
compatibility decompositions and parts of the symbol blocks, in order to
assure round-trip of file names to Mac OS encodings (applications using
the HFS APIs assume they get the same bytes out that they put in).

did I miss somethong ?
--
une bévue
Sep 7 '06 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Borko | last post by:
hi I am having problems getting unicode characters into VB. Using VB6 (sp3) and Access 2000 Characters are displayed correctly in Access, just when I use ADODB (2.7) to read them in VB i get ?...
3
by: Kidus Yared | last post by:
I am having a problem displaying Unicode characters on my Forms labels and buttons. After coding Button1.Text = unicode; where the unicode is a Unicode character or string (‘\u1234’ or...
3
by: Mohammad-Reza | last post by:
We are writing an application for a specific culture(Arabic or Farsi). This application involves using DataAdapter, OLEDB Connection and the DataSet. We didn't use the .NET data binding, just field...
3
by: john | last post by:
I need to produce a RTF-document which is filled with data from a database. I've created a RTF-document in WordPad (a template, so to speak) which contains 'placeholders', for example '<dd01>',...
5
by: Matthew Thompson | last post by:
I have as issue I am finding hard to research. I use a stored proecdure in SQL 2000 to provide search capability for our database of news stories and articles. Being an international magazine...
3
by: Christian Nunciato | last post by:
Hi all: I've read through the various related posts in this forum, but without any success as yet. I've got an ASP.NET application built in VS.NET 2003, and I'm trying to get the Armenian...
5
by: abhi147 | last post by:
Hi , I want to pass a string of unicode characters to a function . The string is a 4 bit unicode character string like"\xab\x0a\x0c\x0d" . These chars get converted to their ascii equivalent ....
6
by: geegeegeegee | last post by:
Hi All, I have come across a difficult problem to do with extracting UniCode characters from RTF strings. A detailed description of my problem is below, if anyone could help, it would be much...
0
by: M.-A. Lemburg | last post by:
On 2008-07-01 20:31, Peter Bulychev wrote: You could write a codec which translates Unicode into a ASCII lookalike characters, but AFAIK there is no standard for doing this. I guess the best...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.