473,809 Members | 2,809 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

scanning UTF-8 characters

Hello,

I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

thanks
-kamal
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
Apr 2 '06 #1
5 2799
On 02 Apr 2006 09:47:07 GMT, "Kamal R. Prasad" <ka****@acm.org > wrote
in comp.lang.c:
Hello,

I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

thanks
-kamal


Neither lex nor UTF-8 is defined by the C language. Information on
UTF-8 can be obtained from http://www.unicode.org. Questions about
lex can be asked in news:comp.unix. programmer.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.l earn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
Apr 6 '06 #2
"Kamal R. Prasad" <ka****@acm.org > writes:
Hello,

I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.


Not really topical here in clc and clcm, I'm afraid. I've redirected
to comp.unix.progr ammer, where I believe you'll find more people able
to answer your question.

The /first/ non-ascii character's byte will be > 0xC0. But, yeah, you
should test for the high-bit. /All/ of the bytes in a
non-single-byte-character will be greater than 0x7f. The first byte
also has encoded information about how many bytes there are, total,
for this character.

As to how this fits in with lex, I'm not really qualified to say
much. Is it sufficient to look for the high bit? It depends on what
you intend to do after you've found one. And to be locale agnostic,
you'll probably need something to convert the locale's encoding into
UTF8 before scanning.

--
HTH,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
Apr 6 '06 #3
I don't know if a lexer (for me, I get flex in hand) could do anything to
identify the UTF-8 char, I m afraid u should do the job by ur own code.

"Kamal R. Prasad" <ka****@acm.org > wrote in message
news:cl******** ********@pletho ra.net...
Hello,

I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.

thanks
-kamal
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be
seen,
or the newsgroup name in square brackets in the subject line. Sorry.

--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
Apr 6 '06 #4
["Followup-To:" header set to comp.lang.c.mod erated.]
On 2006-04-02, Kamal R. Prasad <ka****@acm.org > wrote:
Hello,
I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice?
in most cases, there's a thing called windowing that can IIRC substitute
other symbols into the 0x00 to 0x7f range.
Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.


if you treat characters above 7f as if they were ordinary letters and make
no assumption of word-length or display width you should be fairly safe,

if you're hoping to identify digits and punctuation in unusual scripts
(Chinese, Sinhala, Sanscrit, Klingon etc) you'll need to do convert your
UTF-8 stream to unicode glyphs and pass them to the lexer.
Bye.
Jasen
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
Apr 6 '06 #5
"Kamal R. Prasad" wrote:
I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.


You need to check that your version of "lex" supports wide characters,
which most do not. Otherwise you have to lex every possible character
into a token, which is almost certainly not what you want to do.

In most situations, it is easier to hand-code a lexer than to use "lex",
and here is a case where this is even more likely to be the case.

Convert the UTF-8 to 31-bit "Unicode" points and handle characters
solely as "wide" characters throughout.
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
Apr 6 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

21
12224
by: CHANGE username to westes | last post by:
What are the most popular, and well supported, libraries of drivers for bar code scanners that include a Visual Basic and C/C++ API? My requirements are: - Must allow an application to be written to a single interface, but support many different manufacturers' barcode scanning devices. I do not want to be tied to one manufacturers' software interfaces. - Must support use of the scanner from Visual Basic, and ideally from C/C++ and...
4
3246
by: Zen | last post by:
I'm using Access 2000, and I'd like to know if there is a way to use a scanner (flatbed, doc-feed, etc) to scan forms with OMR or OCR software, and have the data be automatically (or if not automatically then using a macro or other means) entered into tables. I guess the real question is do I need to use an expensive program to do this or is it codable suing Access/VB, and if it is codable, any suggestions as to how to start? Many...
8
2452
by: Marie-Christine Bechara | last post by:
I have a form with a button called btnScan. When i click on this button i want to scan a file and save it in the database. Any hints?? ideas??? solutions??? *** Sent via Developersdex http://www.developersdex.com *** Don't just participate in USENET...get rewarded for it!
2
4071
by: Dan | last post by:
I have an application that uses a COM port barcode scanner. This uses a listener to notify the application when a barcode has been scanned. The application now needs to be modified to use a Human Interface Device scanner. This means I need to read in the barcode from the keyboard input and notify the same listeners. This seemed relativly easy until I went looking for the keyboard input stream. Does anyone know where it is? (bearing in mind...
1
3241
by: Bruce D | last post by:
I'm researching a VB .NET project that will have two functions: 1 - scan images using TWAIN drivers of scanner 2 - read barcode of that scanned image I've been researching many companies and was wondering if anyone had any recommendations of any options about the companies I've listed here. Your input is appreciated! The thing here is that I like Dosadi:EZTwain...but they don't offer barcode recognition. It seems these are the only...
1
1923
by: Curtis | last post by:
I am researching a project that involves controling a high speed document scanner. I am trying to find a .Net capable library to access the TWAIN drivers for the image scanners to automatate the scanning of the documents and sort them according to a barcodes. Does anyone have any good resources or suggestions of where to get started. It seems like there are alot of compoenents available to purchase to add the TWAIN scanning capability to...
1
2262
by: Mantorok | last post by:
Hi all Does anyone here have to use VS on a machine that has "on-access" virus scanning? We have Read/Write scanning turned on on our desktops and we're pretty sure that it is killing VS and pretty much any other app. We use VS at home and it's blistering, and our machines at home are a lower spec - so what could be the problem?
1
1543
kirubagari
by: kirubagari | last post by:
For i = 49 To mfilesize Step 6 rich1.SelStart = Len(rich1.Text) rich1.SelText = "Before : " & HexByte2Char(arrByte(i)) & _ " " & HexByte2Char(arrByte(i + 1)) & " " _ & HexByte2Char(arrByte(i + 2)) & " " _ & HexByte2Char(arrByte(i + 3)) & " " _ & HexByte2Char(arrByte(i + 4)) & " " _
8
1766
by: =?Utf-8?B?QnJ5YW4=?= | last post by:
Hello group. I have some code (given to me), but I don't know alot about ASP, so I was hoping someone here can help. Running on Win 2008 server. The code below will scan a folder and subfolder with a date/time input and return xml structure off all files that are newer than the supplied date/time. The problem is that the returned xml has path names like C:\folder\subfolder\filename.ext I would like it to be more like...
2
4399
by: iheartvba | last post by:
Hi Guys, I have been using EzTwain Pro to scan documents into my access program. It allows me to specify the location I want the Doc to go to. It also allows me to set the name of the document as well. The link to the program is as below : EZTwain imaging library system - add TWAIN scanning or image capture to your application. I'm not sure if it's the nature of the program, but the scanning module is very slow to load. Otherwise it's...
0
9602
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10639
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10120
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9200
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7661
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6881
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5550
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4332
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3861
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.