"Kamal R. Prasad" <ka****@acm.org> writes:
Hello,
I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.
Not really topical here in clc and clcm, I'm afraid. I've redirected
to comp.unix.programmer, where I believe you'll find more people able
to answer your question.
The /first/ non-ascii character's byte will be > 0xC0. But, yeah, you
should test for the high-bit. /All/ of the bytes in a
non-single-byte-character will be greater than 0x7f. The first byte
also has encoded information about how many bytes there are, total,
for this character.
As to how this fits in with lex, I'm not really qualified to say
much. Is it sufficient to look for the high bit? It depends on what
you intend to do after you've found one. And to be locale agnostic,
you'll probably need something to convert the locale's encoding into
UTF8 before scanning.
--
HTH,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
--
comp.lang.c.moderated - moderation address:
cl**@plethora.net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.