Hello,
I am using a lexer (lex specification supplied to lex) to parse data,
and one of the requirements is to handle UTF-8 characters. My
understanding is that the first non-ascii character's byte will be >
0x7f in a UTF-8 character If I look for the same in yytext -will that
suffice? Is there some std function that one can use to operate on the
input stream? I want my code to be locale agnostic.
thanks
-kamal
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. 5 2799
On 02 Apr 2006 09:47:07 GMT, "Kamal R. Prasad" <ka****@acm.org > wrote
in comp.lang.c: Hello,
I am using a lexer (lex specification supplied to lex) to parse data, and one of the requirements is to handle UTF-8 characters. My understanding is that the first non-ascii character's byte will be > 0x7f in a UTF-8 character If I look for the same in yytext -will that suffice? Is there some std function that one can use to operate on the input stream? I want my code to be locale agnostic.
thanks -kamal
Neither lex nor UTF-8 is defined by the C language. Information on
UTF-8 can be obtained from http://www.unicode.org. Questions about
lex can be asked in news:comp.unix. programmer.
--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.l earn.c-c++ http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
"Kamal R. Prasad" <ka****@acm.org > writes: Hello,
I am using a lexer (lex specification supplied to lex) to parse data, and one of the requirements is to handle UTF-8 characters. My understanding is that the first non-ascii character's byte will be > 0x7f in a UTF-8 character If I look for the same in yytext -will that suffice? Is there some std function that one can use to operate on the input stream? I want my code to be locale agnostic.
Not really topical here in clc and clcm, I'm afraid. I've redirected
to comp.unix.progr ammer, where I believe you'll find more people able
to answer your question.
The /first/ non-ascii character's byte will be > 0xC0. But, yeah, you
should test for the high-bit. /All/ of the bytes in a
non-single-byte-character will be greater than 0x7f. The first byte
also has encoded information about how many bytes there are, total,
for this character.
As to how this fits in with lex, I'm not really qualified to say
much. Is it sufficient to look for the high bit? It depends on what
you intend to do after you've found one. And to be locale agnostic,
you'll probably need something to convert the locale's encoding into
UTF8 before scanning.
--
HTH,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
I don't know if a lexer (for me, I get flex in hand) could do anything to
identify the UTF-8 char, I m afraid u should do the job by ur own code.
"Kamal R. Prasad" <ka****@acm.org > wrote in message
news:cl******** ********@pletho ra.net... Hello,
I am using a lexer (lex specification supplied to lex) to parse data, and one of the requirements is to handle UTF-8 characters. My understanding is that the first non-ascii character's byte will be > 0x7f in a UTF-8 character If I look for the same in yytext -will that suffice? Is there some std function that one can use to operate on the input stream? I want my code to be locale agnostic.
thanks -kamal -- comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must have an appropriate newsgroups line in your header for your mail to be seen, or the newsgroup name in square brackets in the subject line. Sorry.
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
["Followup-To:" header set to comp.lang.c.mod erated.]
On 2006-04-02, Kamal R. Prasad <ka****@acm.org > wrote: Hello,
I am using a lexer (lex specification supplied to lex) to parse data, and one of the requirements is to handle UTF-8 characters. My understanding is that the first non-ascii character's byte will be > 0x7f in a UTF-8 character If I look for the same in yytext -will that suffice?
in most cases, there's a thing called windowing that can IIRC substitute
other symbols into the 0x00 to 0x7f range.
Is there some std function that one can use to operate on the input stream? I want my code to be locale agnostic.
if you treat characters above 7f as if they were ordinary letters and make
no assumption of word-length or display width you should be fairly safe,
if you're hoping to identify digits and punctuation in unusual scripts
(Chinese, Sinhala, Sanscrit, Klingon etc) you'll need to do convert your
UTF-8 stream to unicode glyphs and pass them to the lexer.
Bye.
Jasen
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.
"Kamal R. Prasad" wrote: I am using a lexer (lex specification supplied to lex) to parse data, and one of the requirements is to handle UTF-8 characters. My understanding is that the first non-ascii character's byte will be > 0x7f in a UTF-8 character If I look for the same in yytext -will that suffice? Is there some std function that one can use to operate on the input stream? I want my code to be locale agnostic.
You need to check that your version of "lex" supports wide characters,
which most do not. Otherwise you have to lex every possible character
into a token, which is almost certainly not what you want to do.
In most situations, it is easier to hand-code a lexer than to use "lex",
and here is a case where this is even more likely to be the case.
Convert the UTF-8 to 31-bit "Unicode" points and handle characters
solely as "wide" characters throughout.
--
comp.lang.c.mod erated - moderation address: cl**@plethora.n et -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: CHANGE username to westes |
last post by:
What are the most popular, and well supported, libraries of drivers for bar
code scanners that include a Visual Basic and C/C++ API? My requirements
are:
- Must allow an application to be written to a single interface, but support
many different manufacturers' barcode scanning devices. I do not want to
be tied to one manufacturers' software interfaces.
- Must support use of the scanner from Visual Basic, and ideally from C/C++
and...
|
by: Zen |
last post by:
I'm using Access 2000, and I'd like to know if there is a way to use a
scanner (flatbed, doc-feed, etc) to scan forms with OMR or OCR software, and
have the data be automatically (or if not automatically then using a macro
or other means) entered into tables. I guess the real question is do I need
to use an expensive program to do this or is it codable suing Access/VB, and
if it is codable, any suggestions as to how to start?
Many...
|
by: Marie-Christine Bechara |
last post by:
I have a form with a button called btnScan. When i click on this button
i want to scan a file and save it in the database. Any hints?? ideas???
solutions???
*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
|
by: Dan |
last post by:
I have an application that uses a COM port barcode scanner. This uses
a listener to notify the application when a barcode has been scanned.
The application now needs to be modified to use a Human Interface
Device scanner. This means I need to read in the barcode from the
keyboard input and notify the same listeners. This seemed relativly
easy until I went looking for the keyboard input stream. Does anyone
know where it is? (bearing in mind...
|
by: Bruce D |
last post by:
I'm researching a VB .NET project that will have two functions:
1 - scan images using TWAIN drivers of scanner
2 - read barcode of that scanned image
I've been researching many companies and was wondering if anyone had any
recommendations of any options about the companies I've listed here. Your
input is appreciated!
The thing here is that I like Dosadi:EZTwain...but they don't offer barcode
recognition. It seems these are the only...
| |
by: Curtis |
last post by:
I am researching a project that involves controling a high speed document
scanner. I am trying to find a .Net capable library to access the TWAIN
drivers for the image scanners to automatate the scanning of the documents
and sort them according to a barcodes. Does anyone have any good resources
or suggestions of where to get started. It seems like there are alot of
compoenents available to purchase to add the TWAIN scanning capability to...
|
by: Mantorok |
last post by:
Hi all
Does anyone here have to use VS on a machine that has "on-access" virus
scanning?
We have Read/Write scanning turned on on our desktops and we're pretty sure
that it is killing VS and pretty much any other app. We use VS at home and
it's blistering, and our machines at home are a lower spec - so what could
be the problem?
|
by: kirubagari |
last post by:
For i = 49 To mfilesize Step 6
rich1.SelStart = Len(rich1.Text)
rich1.SelText = "Before : " & HexByte2Char(arrByte(i)) & _
" " & HexByte2Char(arrByte(i + 1)) & " " _
& HexByte2Char(arrByte(i + 2)) & " " _
& HexByte2Char(arrByte(i + 3)) & " " _
& HexByte2Char(arrByte(i + 4)) & " " _
|
by: =?Utf-8?B?QnJ5YW4=?= |
last post by:
Hello group.
I have some code (given to me), but I don't know alot about ASP, so I was
hoping someone here can help. Running on Win 2008 server.
The code below will scan a folder and subfolder with a date/time input and
return xml structure off all files that are newer than the supplied date/time.
The problem is that the returned xml has path names like
C:\folder\subfolder\filename.ext
I would like it to be more like...
|
by: iheartvba |
last post by:
Hi Guys,
I have been using EzTwain Pro to scan documents into my access program.
It allows me to specify the location I want the Doc to go to. It also allows me to set the name of the document as well. The link to the program is as below :
EZTwain imaging library system - add TWAIN scanning or image capture to your application.
I'm not sure if it's the nature of the program, but the scanning module is very slow to load. Otherwise it's...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
| |
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |