Reading text file containing accented vowels

Phil Slater

I'm trying to process a collection of text files, reading word by word. The
program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.

Thanks

Phil

Jul 22 '05 #1

Subscribe Post Reply

18193

Howard

"Phil Slater" <ph*********@amsjv.com> wrote in message
news:40********@baen1673807.greenlnk.net...

I'm trying to process a collection of text files, reading word by word. The program run hangs whenever it encounters a word with an accented letter
(like rtle or passi) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful if someone could spare the time to enlighten me as to how to read in and
process these strings.

Thanks

Phil

It would help greatly if we could see the code that's doing the reading now.
There's nothing special about "accented vowels" that would prevent reading
them, except that their values are not in the range of 0..127. If you read
unsigned char values (instead of char), then you can read anything in the
range 0..255. (Maybe that's the problem?) Using a stream and reading into
a string should work, and then you can parse each line word-by-word. But
again, I can't tell where your code *might* be stuck without seeing the
code.

-Howard

Jul 22 '05 #2

Victor Bazarov

Phil Slater wrote:

I'm trying to process a collection of text files, reading word by word. The
program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.

It's possible that your accented characters are from the "extended ASCII"
part of the code table. Try reading the words using _unsigned_char_ type.

V

Jul 22 '05 #3

John Harrison

"Phil Slater" <ph*********@amsjv.com> wrote in message
news:40********@baen1673807.greenlnk.net...

I'm trying to process a collection of text files, reading word by word. The program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful if someone could spare the time to enlighten me as to how to read in and
process these strings.

I would guess that some where you are assuming that char values are
positive, for instance by using a char variable as the index of an array.
This is not necessarily true of non-ASCII characters which can have negative
values (depending on your implementation). Casts to unsigned char at
appropriate places in your code might solve this.

I would like to be more specific but you forgot to include any code at all
in your post.

john

Jul 22 '05 #4

Phil Slater

Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>
<>
<>
<>
<>
<>
.... ad infinitum

Looks like the stream goes into a fail state when it hits the á

Now trying the same file with your suggestion of using unsigned char for
input, the program reads the whole file, but now it's not word-by-word -
the delimiter for input seems to have changed. The changed program is

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

The output I get is:

<bungle

Gruntfutter

Kárahnjúkar>

The change to unsigned char has solved the basic problem of accented
letters, but leaves me wondering what exactly the behaviour of these
basic_string<unsigned char> things is. They obviously don't behave like
ordinary strings (if the extraction operator is anything to go by).

Any ideas?

Thanks for your help.

Phil
Howard wrote:

"Phil Slater" <ph*********@amsjv.com> wrote in message
news:40********@baen1673807.greenlnk.net...
I'm trying to process a collection of text files, reading word by word.

The
program run hangs whenever it encounters a word with an accented letter
(like rtle or passi) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a

suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be

grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.

Thanks

Phil

It would help greatly if we could see the code that's doing the reading now.
There's nothing special about "accented vowels" that would prevent reading
them, except that their values are not in the range of 0..127. If you read
unsigned char values (instead of char), then you can read anything in the
range 0..255. (Maybe that's the problem?) Using a stream and reading into
a string should work, and then you can parse each line word-by-word. But
again, I can't tell where your code *might* be stuck without seeing the
code.

-Howard

Jul 22 '05 #5

John Harrison

"Phil Slater" <ph*********@removethisbitamsjv.com> wrote in message
news:c8**********@hercules.btinternet.com...

Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}
It's not relevant to your problem but your eof test is incorrect. eof does
not tell you when you are at the end of file. It is only reliably true
*after* you have attempted to read past the end of file. In other words you
should write

for (;;)
{
f >> word;
if (f.eof())
break;
cout << ...
}

or more simply

while (f >> word)
cout << ...

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>
I think you have a broken implementation of the STL, that works fine for me.
Which compiler are you using?

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

That should simply not compile, the char of the string type is not that same
as the char of the fstream type.

Time to upgrade your compiler I think.

john

Jul 22 '05 #6

Phil Slater

John Harrison wrote:

"Phil Slater" <ph*********@removethisbitamsjv.com> wrote in message
news:c8**********@hercules.btinternet.com...
Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

It's not relevant to your problem but your eof test is incorrect. eof does
not tell you when you are at the end of file. It is only reliably true
*after* you have attempted to read past the end of file. In other words you
should write

for (;;)
{
f >> word;
if (f.eof())
break;
cout << ...
}

or more simply

while (f >> word)
cout << ...

Thanks for that.

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>

I think you have a broken implementation of the STL, that works fine for me.

Yours reads accented characters into a basic_string<char>? So I guess it
must be storing the á as a negative number?
Which compiler are you using?
g++ (cygwin)...
gcc version 2.95.3-5 (cygwin special)

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

That should simply not compile, the char of the string type is not that same
as the char of the fstream type.

Time to upgrade your compiler I think.

Which compiler are you using?

john

Jul 22 '05 #7

John Harrison

> >

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>

I think you have a broken implementation of the STL, that works fine for

me.
Yours reads accented characters into a basic_string<char>? So I guess it
must be storing the á as a negative number?
Yes.

Which compiler are you using?

g++ (cygwin)...
gcc version 2.95.3-5 (cygwin special)

I've heard that gcc 2.95 has a poor implementation of the standard template
library (STL), your experience seems to prove it. Last post I tried with
with VC++ 7.1, I've just tried with gcc 3.3.1 and got the same result. Your
first program runs correctly, your second doesn't compile. I really think
you are going to have to upgrade.

john

Jul 22 '05 #8

Howard

"Phil Slater" <ph*********@removethisbitamsjv.com> wrote in message
news:c8**********@hercules.btinternet.com...

Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Karahnjzkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>
<>
<>
<>
<>
<>
... ad infinitum

Looks like the stream goes into a fail state when it hits the a

Now trying the same file with your suggestion of using unsigned char for
input, the program reads the whole file, but now it's not word-by-word -
the delimiter for input seems to have changed. The changed program is

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

The output I get is:

<bungle

Gruntfutter

Karahnjzkar>

The change to unsigned char has solved the basic problem of accented
letters, but leaves me wondering what exactly the behaviour of these
basic_string<unsigned char> things is. They obviously don't behave like
ordinary strings (if the extraction operator is anything to go by).

Any ideas?

That's correct. You're reading in a string. There is nothing that says
that that reading is delimited by anything when reading.

So what do you do? I'd probably read in a line at a time into a string (and
then parse the string into words, if there can be more than one word on a
line). I think std::getline is the function for reading a line.

Also, don't use while (!f.eof()), use while (getline(whatever)). The eof()
function is not valid to check until *after* attempting a read.

-Howard

Jul 22 '05 #9

by: Bernhard Georg Enders | last post by:

I'm using the php 'file' command to read the contents of an ASCII text file to a variable. The original text file contains some accented and special characters. The problem arises when I echo this...

PHP

Changing the default text codec

by: Fuzzyman | last post by:

Sorry if my terminology is wrong..... but I'm having intermittent problems dealing with accented characters in python. (Only from the 8 bit latin-1 character set I think..) I've written an...

Python

Problem with accents i.e. Ã© when reading a text file

by: Cherif Diallo | last post by:

Hi I have a trivial question for the experts. I would lilke to read be able to read a text file that could contain french characters with accents. I'm opening the file with the...

.NET Framework

Translate accented characters

by: JezB | last post by:

Is there anything in the framework which will help translate accented characters in strings to their standard counterparts? eg. "Gráda" to "Grada"

C# / C Sharp

VB BinaryReader, reading Characters from stream

by: James Minns | last post by:

Hi, I have the following problem with my VB code: accented characters are being transformed into a cr-lf pair! I am reading a sequence of bytes from a binary file, one part of which is a text...

Visual Basic .NET

TextFieldParser - reading tab delimited file

by: al jones | last post by:

Iâ€™m using textfieldparser to read a data file. which contains, for example: AmondÃ³ Szegi Amondo Szegi andrÃ© nossek AndrÃ© Nossek Â© Characte Character Note the vowels with diacriticals...

Visual Basic .NET

Binary or text file

by: list | last post by:

Hi folks, I am new to Googlegroups. I asked my questions at other forums, since now. I have an important question: I have to check files if they are binary(.bmp, .avi, .jpg) or text(.txt,...

C / C++

Reading (and writing?) audio file tags

by: Paul Moore | last post by:

I'd like to write some scripts to analyze and manipulate my music files. The files themselves are in MP3 and FLAC format (mostly MP3, but FLAC where I ripped original CDs and wanted a lossless...

Python

Reading a text file with spanish accents

by: Amy L. | last post by:

I am at an absolute loss on what is going on here. I have a text file with some Spanish writing. Some of the characters have accents. I have not found anyway to read this text file and echo the...

C# / C Sharp

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Reading text file containing accented vowels

Similar topics