473,416 Members | 1,630 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,416 software developers and data experts.

Reading text file containing accented vowels

I'm trying to process a collection of text files, reading word by word. The
program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.

Thanks

Phil
Jul 22 '05 #1
8 18193

"Phil Slater" <ph*********@amsjv.com> wrote in message
news:40********@baen1673807.greenlnk.net...
I'm trying to process a collection of text files, reading word by word. The program run hangs whenever it encounters a word with an accented letter
(like rtle or passi) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful if someone could spare the time to enlighten me as to how to read in and
process these strings.

Thanks

Phil

It would help greatly if we could see the code that's doing the reading now.
There's nothing special about "accented vowels" that would prevent reading
them, except that their values are not in the range of 0..127. If you read
unsigned char values (instead of char), then you can read anything in the
range 0..255. (Maybe that's the problem?) Using a stream and reading into
a string should work, and then you can parse each line word-by-word. But
again, I can't tell where your code *might* be stuck without seeing the
code.

-Howard

Jul 22 '05 #2
Phil Slater wrote:
I'm trying to process a collection of text files, reading word by word. The
program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.


It's possible that your accented characters are from the "extended ASCII"
part of the code table. Try reading the words using _unsigned_char_ type.

V
Jul 22 '05 #3

"Phil Slater" <ph*********@amsjv.com> wrote in message
news:40********@baen1673807.greenlnk.net...
I'm trying to process a collection of text files, reading word by word. The program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful if someone could spare the time to enlighten me as to how to read in and
process these strings.


I would guess that some where you are assuming that char values are
positive, for instance by using a char variable as the index of an array.
This is not necessarily true of non-ASCII characters which can have negative
values (depending on your implementation). Casts to unsigned char at
appropriate places in your code might solve this.

I would like to be more specific but you forgot to include any code at all
in your post.

john
Jul 22 '05 #4
Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>
<>
<>
<>
<>
<>
.... ad infinitum

Looks like the stream goes into a fail state when it hits the á

Now trying the same file with your suggestion of using unsigned char for
input, the program reads the whole file, but now it's not word-by-word -
the delimiter for input seems to have changed. The changed program is

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

The output I get is:

<bungle

Gruntfutter

Kárahnjúkar>

The change to unsigned char has solved the basic problem of accented
letters, but leaves me wondering what exactly the behaviour of these
basic_string<unsigned char> things is. They obviously don't behave like
ordinary strings (if the extraction operator is anything to go by).

Any ideas?

Thanks for your help.

Phil
Howard wrote:
"Phil Slater" <ph*********@amsjv.com> wrote in message
news:40********@baen1673807.greenlnk.net...
I'm trying to process a collection of text files, reading word by word.


The
program run hangs whenever it encounters a word with an accented letter
(like rtle or passi) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a


suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be


grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.

Thanks

Phil

It would help greatly if we could see the code that's doing the reading now.
There's nothing special about "accented vowels" that would prevent reading
them, except that their values are not in the range of 0..127. If you read
unsigned char values (instead of char), then you can read anything in the
range 0..255. (Maybe that's the problem?) Using a stream and reading into
a string should work, and then you can parse each line word-by-word. But
again, I can't tell where your code *might* be stuck without seeing the
code.

-Howard



Jul 22 '05 #5

"Phil Slater" <ph*********@removethisbitamsjv.com> wrote in message
news:c8**********@hercules.btinternet.com...
Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}
It's not relevant to your problem but your eof test is incorrect. eof does
not tell you when you are at the end of file. It is only reliably true
*after* you have attempted to read past the end of file. In other words you
should write

for (;;)
{
f >> word;
if (f.eof())
break;
cout << ...
}

or more simply

while (f >> word)
cout << ...

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>
I think you have a broken implementation of the STL, that works fine for me.
Which compiler are you using?

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}


That should simply not compile, the char of the string type is not that same
as the char of the fstream type.

Time to upgrade your compiler I think.

john
Jul 22 '05 #6


John Harrison wrote:
"Phil Slater" <ph*********@removethisbitamsjv.com> wrote in message
news:c8**********@hercules.btinternet.com...
Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

It's not relevant to your problem but your eof test is incorrect. eof does
not tell you when you are at the end of file. It is only reliably true
*after* you have attempted to read past the end of file. In other words you
should write

for (;;)
{
f >> word;
if (f.eof())
break;
cout << ...
}

or more simply

while (f >> word)
cout << ...


Thanks for that.
Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>

I think you have a broken implementation of the STL, that works fine for me.


Yours reads accented characters into a basic_string<char>? So I guess it
must be storing the á as a negative number?
Which compiler are you using?
g++ (cygwin)...
gcc version 2.95.3-5 (cygwin special)

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

That should simply not compile, the char of the string type is not that same
as the char of the fstream type.

Time to upgrade your compiler I think.


Which compiler are you using?

john


Jul 22 '05 #7
> >
Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>

I think you have a broken implementation of the STL, that works fine for

me.
Yours reads accented characters into a basic_string<char>? So I guess it
must be storing the á as a negative number?
Yes.
Which compiler are you using?


g++ (cygwin)...
gcc version 2.95.3-5 (cygwin special)


I've heard that gcc 2.95 has a poor implementation of the standard template
library (STL), your experience seems to prove it. Last post I tried with
with VC++ 7.1, I've just tried with gcc 3.3.1 and got the same result. Your
first program runs correctly, your second doesn't compile. I really think
you are going to have to upgrade.

john
Jul 22 '05 #8

"Phil Slater" <ph*********@removethisbitamsjv.com> wrote in message
news:c8**********@hercules.btinternet.com...
Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Karahnjzkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>
<>
<>
<>
<>
<>
... ad infinitum

Looks like the stream goes into a fail state when it hits the a

Now trying the same file with your suggestion of using unsigned char for
input, the program reads the whole file, but now it's not word-by-word -
the delimiter for input seems to have changed. The changed program is

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

The output I get is:

<bungle

Gruntfutter

Karahnjzkar>

The change to unsigned char has solved the basic problem of accented
letters, but leaves me wondering what exactly the behaviour of these
basic_string<unsigned char> things is. They obviously don't behave like
ordinary strings (if the extraction operator is anything to go by).

Any ideas?


That's correct. You're reading in a string. There is nothing that says
that that reading is delimited by anything when reading.

So what do you do? I'd probably read in a line at a time into a string (and
then parse the string into words, if there can be more than one word on a
line). I think std::getline is the function for reading a line.

Also, don't use while (!f.eof()), use while (getline(whatever)). The eof()
function is not valid to check until *after* attempting a read.

-Howard


Jul 22 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Bernhard Georg Enders | last post by:
I'm using the php 'file' command to read the contents of an ASCII text file to a variable. The original text file contains some accented and special characters. The problem arises when I echo this...
5
by: Fuzzyman | last post by:
Sorry if my terminology is wrong..... but I'm having intermittent problems dealing with accented characters in python. (Only from the 8 bit latin-1 character set I think..) I've written an...
3
by: Cherif Diallo | last post by:
Hi I have a trivial question for the experts. I would lilke to read be able to read a text file that could contain french characters with accents. I'm opening the file with the...
9
by: JezB | last post by:
Is there anything in the framework which will help translate accented characters in strings to their standard counterparts? eg. "Gráda" to "Grada"
2
by: James Minns | last post by:
Hi, I have the following problem with my VB code: accented characters are being transformed into a cr-lf pair! I am reading a sequence of bytes from a binary file, one part of which is a text...
3
by: al jones | last post by:
I’m using textfieldparser to read a data file. which contains, for example: Amondó Szegi Amondo Szegi andré nossek André Nossek © Characte Character Note the vowels with diacriticals...
29
by: list | last post by:
Hi folks, I am new to Googlegroups. I asked my questions at other forums, since now. I have an important question: I have to check files if they are binary(.bmp, .avi, .jpg) or text(.txt,...
3
by: Paul Moore | last post by:
I'd like to write some scripts to analyze and manipulate my music files. The files themselves are in MP3 and FLAC format (mostly MP3, but FLAC where I ripped original CDs and wanted a lossless...
5
by: Amy L. | last post by:
I am at an absolute loss on what is going on here. I have a text file with some Spanish writing. Some of the characters have accents. I have not found anyway to read this text file and echo the...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.